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Preface 



This volume contains the papers selected for presentation at the First Asia- 
Pacific Conference on Web Intelligence (WI 2001) held in Maebaslri TERRSA, 
Maebaslri City, Japan, October 23-26, 2001. It was sponsored by ACM SIGART 
and Maebaslri Institute of Technology, in cooperation with ACM SIGCHI, ACM 
SIGWEB, the Japanese Society for Artificial Intelligence (JSAI), JSAI SIGFAI, 
JSAI SIGKBS, and IEICE SIGKBSE. The conference was held jointly with the 
Second Asia-Pacific Conference on Intelligent Agent Technology (IAT 2001). 

WI 2001 was the first conference on a new and emerging subfield of computer 
science known as Web Intelligence. It provided an international forum for rese- 
archers and practitioners to present the state of the art in the development of 
Web intelligence, to examine performance characteristics of various approaches 
in Web-based intelligent information technology, and to cross-fertilize ideas on 
the development of Web-based intelligent information systems among different 
domains. By idea-sharing and discussions on the underlying foundations and 
the enabling technologies of Web intelligence, we hoped to stimulate the future 
development of new models, new methodologies, and new tools for building a 
variety of embodiments of Intelligent Web Information Systems (IWIS). 

In spite of its name, WI 2001 was truly an international conference that at- 
tracted 153 full-length research paper submissions from 31 countries and regions 
of all continents. Each submitted paper was reviewed by at least three experts on 
the basis of technical soundness, relevance, originality, significance, and clarity. 
Based on the review reports, 28 regular papers and 45 short papers were accep- 
ted for presentation and publication. Seven technical sessions were organized, 
namely: Web Information System Environment and Foundations; Web Human- 
Media Engineering; Web Information Management; Web Information Retrieval; 
Web Agents; Web Mining and Farming; Web-Based Applications. 

We wish to express our gratitude to all members of the Conference Commit- 
tee and the International Advisory Board for their instrumental and unfailing 
support. WI 2001 had a very exciting program with a number of features, ran- 
ging from technical sessions, invited talks, demos, and social programs. All of this 
work would not have been possible without the generous dedication of the Pro- 
gram Committee members and the external reviewers, of our keynote speakers, 
Edward A. Feigenbaum and Benjamin Wah, and invited speakers, Nick Cercone, 
James Hendler, W. Lewis Johnson, Riichiro Mizoguchi, Prablrakar Raglravan, 
and Patrick S. P. Wang, who prepared and presented very stimulating talks, 
and of Yiming Ye (Demos & Exhibits Chair) who solicited demo proposals and 
set up the program. We thank them for their strong support. 

The conference Web support team at the Knowledge Information Systems 
Laboratory, Maebaslri Institute of Technology did a terrific job of putting to- 
gether and maintaining the home page for the conference as well as building a 
software, cyber-chair , which is an intelligent agent and interface among orga- 
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nizers, program committee members, and authors/ attendees. We would like to 
thank Juzlren Dong, Muneaki Ohsima, and Noriclrika Hayazaki of the conference 
Web support team for their dedication and hard work. 

WI 2001 could not have taken place without the great team effort of the 
Local Organizing Committee and the support of Maebashi Institute of Techno- 
logy and Maebashi Convention Bureau. Our special thanks go to Nobuo Otani 
(Local Organizing Chair), Sean M. Reedy, Masaaki Sakurai, Kanehisa Sekine, 
and Yoshitsugu Kakemoto (the Local Organizing Committee members) for their 
enormous efforts in planning and arranging the logistics of the conference from 
registration/payment handling, venue preparation, accommodation booking, to 
banquet/social program organization. Our sincere gratitude goes to all of the 
authors who submitted papers. We are very grateful to the WI 2001 corporate 
sponsors: Maebashi Convention Bureau, Maebashi City Government, Gunma 
Prefecture Government, The Japan Research Institute, Limited, United States 
Air Force Office of Scientific Research, Asian Office of Aerospace Research and 
Development, United States Army Research Office in Far East, and Web In- 
telligence Laboratory, Inc. for their generous support. Last but not the least, 
we thank Alfred Hofmann of Springer- Verlag for his help in coordinating the 
publication of the proceedings. 

October 2001 

Ning Zhong, Yiyu Yao 
Jiming Liu, Setsuo Ohsuga 
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Abstract. This paper is about a new research field called Web Intelli- 
gence (WI for short). We try to explain the needs for coining the term as 
a sub-discipline of computer science for systematic studies on advanced 
Web related theories and technologies, as well as the design and imple- 
mentation of Intelligent Web Information Systems (IWIS). Background 
information and related topics are discussed in an attempt to demon- 
strate why we consider WI to be a subject worthy of study and, at the 
same time, to establish a starting point for the further development of 
WI. 

1 Introduction 

With the rapid growth of Internet and World Wide Web (WWW), we have now 
entered into a new information age. The Web provides a total new media for 
communication, which goes far beyond the traditional communication medias, 
such as radio, telephone and television. The Web has significant impacts on both 
academic research and ordinary daily life. It revolutionizes the way in which in- 
formation is gathered, stored, processed, presented, shared, and used. The Web 
offers new opportunities and challenges for many areas, such as business, com- 
merce, marketing, finance, publishing, education, research and development. For 
computer scientists, the Web introduces many new research topics and provides 
a new platform to reconsider old problems. It might be high time to create a new 
sub-discipline of computer science covering theories and technologies related to 
the Web. Web Intelligence is our proposal for this purpose. 

The authors of this paper conceived Web Intelligence (WI for short) in late 
1999. We felt that although a number of conferences and journals publish or 
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cover Web or Internet related topics, there was no conference and journal de- 
voted to intelligence aspects in the design and implementation of Web informa- 
tion systems. We suspected that there exists a need for a conference devoted 
to Web Intelligence. At the 24th Annual International Computer Software and 
Applications Conference (IEEE COMPSAC) in 2000, we first introduced Web 
Intelligence and formally announced the new Web Intelligence conference in a 
position paper at a Panel on Data Mining and Web Information Systems m- 
We are impressed by the quick and vast responses, as well as kind support, from 
research community and reputable publishers. 

The main objective of this paper is to formally initiate a sub-discipline of 
computer science by coining the term Web Intelligence, into which Web related 
research can be fitted. It is more a proposal and an appeal for the creation of WI 
on its own rights, rather than a precise definition of what is exactly WI. We are 
more concerned with the necessity and benefits of WI, as well as research topics 
of WI. It is our intention to create further discussion and critical examination 
of WI among researchers working on Web related topics. 

The rest of the paper is organized as follows. In Section 2, we provide a 
definition of Web Intelligence. In Section 3, we argue that it is necessary and 
beneficiary to have a new sub-discipline of computer science labelled by WI. In 
Section 4, we present an overview of Artificial Intelligence and show its relevance 
to WI. In Section 5, we provide a list of topics of WI. In Section 6, we discuss 
trends and challenges of WI related research and development. Section 7 is de- 
voted to intelligent Web Agents (WA). Finally, Section 8 introduces the Web 
Intelligence conference, and Section 9 gives conclusion, respectively. 



2 What Is Web Intelligence? 

At this very early stage, we are not sure if a formal definition of Web Intelligence 
is useful or desirable. Nevertheless, we suggest the following definition: 

“Web Intelligence (WI) exploits Artificial Intelligence (AI) and advanced 
Information Technology (IT) on the Web and Internet.” 

This definition has the following implications. The basis of WI is AI and IT. The 
“I” happens to be shared by both “AI” and “IT” , although with different mean- 
ings in them, and “W” defines the platform on which WI research is carried out. 
The goal of WI is the joint goals of AI and IT on the new platform of the Web. 
That is, WI applies AI and IT for the design and implementation of Intelligent 
Web Information Systems (IWIS). An IWIS should be able to perform functions 
normally associated with human intelligence, such as reasoning, learning, and 
self improvement. 

There perhaps might not be a standard and non-controversial definition of 
WI, as the case that there is no standard definition of AI. One may argued that 
our definition of WI focuses more on the software aspects of the Web. It is not 
our intention to exclude any research topic using the proposed definition. The 
term, Web Intelligence, should be considered as an umbrella or a label of a new 
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branch of research centered on the Web. Our definition simply states the scopes 
and goals of WI. This allows us to include any theories and technologies that 
either fall in the scopes or aim at the same goals. To complement the formal 
definition, we try to make the picture clearer by listing topics to be covered by 
WI. 

WI will be an ever-changing research branch. It will be evolving with devel- 
opment of the Web as new media for information gathering, storage, processing, 
delivery and utilization. It is our expectation that WI will be evolved into an 
inseparable research branch of computer science. Although no one can predict 
the future in detail and without uncertainty, it is clear that WI would have huge 
impacts on the application of computers, which in turn will effect our everyday 
lives. 



3 Motivations and Justifications for WI 

The introduction of Web Intelligence (WI) can be motivated and justified from 
both academic and industrial perspectives. 

Two features of the Web make it a useful and unique platform for computer 
applications and research, the size and complexity. The Web contains a huge 
amount of interconnected Web documents known as Web pages. For example, 
the popular search engine Google claims that it can search 1,346,966,000 pages 
as of February 2001. The sheer size of the Web leads to difficulties in the storage, 
management, and efficient and effective retrieval of Web documents. The com- 
plexity of the Web, in terms of connectivity and diversity of Web documents, 
forces us to reconsider many existing information systems, as well as theories, 
methodologies and technologies underlying those systems. One has to deal with 
a heterogeneous collection of structured, unstructured, semi-structured, inter- 
related, and distributed Web documents consisting of texts, images and sounds, 
instead of homogeneous collection of structured and unrelated objects. The lat- 
ter is the subject of study of many conventional information systems, such as 
databases, information retrieval, and multi-media systems. To accommodate the 
needs of the Web, one needs to study issues on the design and implementation 
of the Web-based information systems by combining and extending results from 
existing intelligent information systems. Existing theories and technologies need 
to be modified or enhanced to deal with complexity of the Web. Although indi- 
vidual Web-based information systems are constantly being deployed, advanced 
issues and techniques for developing and for benefiting from the Web remain 
to be systematically studied. The challenges brought by the Web to computer 
scientists may justify the creation of the new sub-discipline, WI, for carrying out 
Web-related research. 

The Web increases the availability and accessibility of information to a much 
larger community than any other computer applications. The introduction of 
Personal Computers (PCs) brought the computational power to ordinary peo- 
ple. It is the Web that delivers more effectively information to everyone at finger 
tips. The Web, no doubt, offers a new means for sharing and transmitting in- 
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formation unmatchable by other media. The revolution started by the Web is 
just beginning. New business opportunities, such as e-commerce, e-banking, and 
e-publication, will increase with the maturity of the Web. It can hardly over- 
emphasize more impacts of the Web on the business and industrial world. The 
creation of a new sub-discipline devoted to Web related research and applications 
might has a significant value in the future. 

The needs for WI may be further illustrated by the current fast growing 
research and industrial activities centered on it. We searched the Web by using 
the keyword “Web Intelligence” through several search engines in February 2001. 
The results are summarized in Table 1. 



Table 1. A Statistics on WI 



Search Engine 


Number of hits 


Lycos (http://search.lycos.com/) 

Google (http://www.google.com/) 

Excite (http://www.excite.com) 

AltaVista (http:/ /www. AltaVista.com/) 
Netscape (http://Netscape.com/) 

Yahoo (http://www.yahoo.com/) 
LookSmart (http:/ /www.looksmart.com/) 


1,102,279 

1,080,000 

223,825 

1,271 

77 

74 

62 



There are some interesting observations from the search results. The Web 
pages returned by most search engines contain both keywords “Web” and “In- 
telligence”, although they may not appear as a phrase in many pages. The co- 
occurrences of the two keywords show their strong association. This provides 
a piece of convincing empirical evidence supporting WI. The identification of 
this association may lead to the recognition of the importance of WI. We also 
used advanced search option of Google to search for the exact phrase “Web 
Intelligence”. We obtained 3,660 hits. We found that many companies concen- 
trate on WI to provide intelligent solutions to business in the new Web-based 
information age. In fact, the majority of the top 40 pages returned by Google is 
industry related. For comparison, we search Researchlndex (the NECI Scientific 
Literature Digital Library, http://citeseer.nj.nec.com/cs) containing an extreme 
large collection of scientific papers on-line. We found only one paper contains the 
phrase “Web Intelligence”. A further search of “Web” and “Intelligence” within 
two words results in 12 documents. They deal with topics such as Web browser 
intelligence, artificial intelligence for Web search, and Internet marketing intel- 
ligence through Web log mining. We also used “Web Intelligence” to query Ask 
Jeeves (http://www.ask.com/) and obtained related topics, such as intelligent 
Web systems, Web artificial intelligence, Web business intelligence, intelligent 
Web agents, intelligent Web robots, intelligent user interfaces, and Web user 
interfaces. Those topics clearly fit the proposed research areas of WI. From the 
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search results, we also noticed that the Call for Papers of this conference (WI’01) 
was either archived on, or linked by many Web sites. 

In summary, we can conclude that there is an interest and a need for WI. It 
seems that academic research needs to speed up to be in pace with the industrial 
demands. The introduction of WI sub-discipline may be helpful in bridging the 
gap between industry demands and academic research. 



4 Perspectives of WI 

As a new branch of research, Web Intelligence exploits Artificial Intelligence 
(AI) and Information Technology (IT) on the Web. On the one hand, it may 
be viewed as applying results from these existing disciplines to a totally new 
domain. On the other hand, WI may also introduce new problems and challenges 
to the established disciplines. WI may also be viewed as an enhancement or an 
extension of AI and IT. It remains to be seen if WI would become a sub-area 
of AI and IT or a child of a successful marriage of AI and IT. However, no 
matter what happens, studies on WI can benefit a great deal from the results, 
experience, success and lessons of AI and IT. 

In their very popular textbook, Russell and Norvig m examined different 
definitions of artificial intelligence from eight other textbooks, in order to decide 
what is exactly AI. They observed that the definitions vary along the two dimen- 
sions. One dimension deals with the functionality and ability of an AI system, 
ranging from thought processes and reasoning ability of the systems to the be- 
havior of the systems. The other dimension deals with the designing philosophy 
of AI systems, ranging from intimating human problem solving to making ratio- 
nal decision. The combination of the two dimensions results in four categories 
of AI systems adopted from Russell and Norvig |^H] : 



Systems that think like humans. 


Systems that think rationally. 


Systems that act like humans. 


Systems that act rationally. 



This classification provides a basis for the studies of various views and approaches 
for AI. It also clearly defines goals in the design of AI systems. According to Rus- 
sell and Norvig EH], they correspond to four approaches, the cognitive modeling 
approach (thinking humanly), the Turing test approach (acting humanly), the 
the laws of thought approach (thinking rationally), and the rational agent ap- 
proach (acting rationally). 

The two rows for separating AI systems in terms of thinking and acting may 
not be a most suitable classification. Action is normally the final result of a 
thinking process. One may argue that the class of systems acting humanly is a 
super set of the class of system thinking humanly. In contrast, the separation of 
human-centered approach and rationality-centered approach may have a signif- 
icant implications in the studies of AI. While earlier research on AI was focus 
more on human-centered approach, rationality-centered approach received more 
attention recently m- 
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The first column is centered around humans and leads to the treatment of 
AI as an empirical science involving hypothesis and experimental confirmation. 
A human-centered approach represents the descriptive view of AI. Under this 
view, a system is designed by intimating the human problem solving. This im- 
plies that a system should have the usual human capabilities such as knowledge 
representation, natural language processing, reasoning, planning and learning. 
The performance of an AI system is measured or evaluated through the Turing 
test. An system is said to be intelligent if it provides human level performance. 
Such a descriptive view dominates the majority of earlier studies of expert sys- 
tems, a special type of AI systems. 

The second column represents the prescriptive or normative view of AI. It 
deals with theoretical principles and laws that an AI system must follow, in- 
stead of intimating humans. That is, a rationalist approach deals with an ideal 
concept of intelligence, which may be independent of human problem solving. 
An AI system is rational if it does the right thing and makes the right deci- 
sion. The normative view of AI based on the well established disciplines such as 
mathematics, logic, and engineering. 

The descriptive and normative views also reflect the experimental and theo- 
retical aspects of AI research. The experimental study represents the descriptive 
view. It covers theories and models for the explanation of the workings of the 
human mind, and applications of AI to solving problems that normally require 
human intelligence. The theoretic study aims at the development of theories of 
rationality, and focuses on the foundations of AI. The two views are complemen- 
tary to each other. Studies in one direction may provide valuable insights into 
the other. 

Web Intelligence concerns the design and development of intelligent Web 
information systems. The previous framework for the study of AI can be im- 
mediately applied to that of Web Intelligence. More specifically, we can cluster 
research in WI into the prescriptive approach and the normative approach, and 
cluster Web information systems in terms of thinking and acting. Various re- 
search topics can be identified and grouped accordingly. 

Like AI, a foundation of WI can be established by drawing results from the 
following many related disciplines: 

Mathematics: 

computation, logic, probability. 

Applied Mathematics and Statistics: 

algorithms, non-classical logics, decision theory, information theory, mea- 
surement theory, utility theory, theories of uncertainty, approximate reason- 
ing. 

Psychology: 

cognitive psychology, cognitive science, human-machine interaction, user in- 
terface. 

Linguistics: 

computational linguistics, natural language processing, machine translation. 

Information Technology: 

information science, databases, information retrieval systems, knowledge dis- 
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covery and data mining, expert systems, knowledge-based systems, decision 
support systems, intelligent information agents. 

The topics under each entry are only intended as examples. They do not form 
an exhausted list. 

In the development of AI, we have witnessed the formulation of many of its 
new sub-branches, such as knowledge-based systems, artificial neural networks, 
genetic algorithms, and intelligent agents. Recently, non-classical AI topics have 
received much attentions under the name of computational intelligence. Compu- 
tational intelligence focuses on the computational aspect of intelligent systems jj : 
1521 . The application of AI in other disciplines also leads to new techniques in 
the corresponding fields. For instance, Business Intelligence (BI) is a result of 
applying artificial intelligence to the business domain. Artificial Intelligence in 
Medicine also proved to be a successful application. When viewing WI in such 
settings, we can identify at least two of its roles. WI may be interpreted “Web 
based Artificial Intelligence” as the study of particular aspects of AI in the con- 
text of the Web, in parallel to the study of computational intelligence. WI may 
also be interpreted as “Artificial Intelligence on the Web” which regards it as a 
new application of AI. 

A more practical goal of WI is the design and implementation of intelligent 
Web information systems (IWIS). It should be realized that an IWIS is an in- 
tegrated system containing many sub-systems. To design such a system, it is 
necessary to apply a variety of theories and technologies. 

In his work on vision, Marr m convincingly made the point that a full 
understanding of an intelligent system involves explanations at various levels. 
The same argument is applicable to the development of an IWIS. We can iden- 
tify at least two levels, the conceptual formulation and physical implementation. 
The conceptual formulation deals with foundations of IWIS, while physical im- 
plementation concerns with construction of an IWIS. The former depends on 
mathematics and logic, and the latter depends on algorithms and programming. 
Each level may be further divided into more sub-levels. Research in WI should 
include any topics at different levels. 



5 Topics Covered by WI 

In order to study advanced Web technology systematically, and develop advanced 
Web-based intelligent information systems, we list several major subtopics in 
each topic below. 

— Web Information System Environment and Foundations: 

• competitive dynamics of Web sites, 

• emerging Web technology, 

• network community formation and support, 

• new Web information description and query languages, 

• the semantic Web, 

• theories of small world Web, 
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• Web information system development tools, 

• Web protocols. 

Web Human-Media Engineering: 

• the art of Web page design, 

• multimedia information representation, 

• multimedia information processing, 

• visualization of Web information, 

• Web-based human computer interface. 

Web Information Management: 

• data quality management, 

• information transformation, 

• Internet and Web-based data management, 

• multi-dimensional Web databases, 

• OLAP (on-line analytical processing), 

• multimedia information management, 

• new data models for the Web, 

• object oriented Web information management, 

• personalized information management, 

• semi-structured data management, 

• use and management of metadata, 

• Web knowledge management, 

• Web page automatic generation and updating, 

• Web security, integrity, privacy and trust. 

Web Information Retrieval: 

• approximate retrieval, 

• conceptual information extraction, 

• image retrieval, 

• multi-linguistic information retrieval, 

• multimedia retrieval, 

• new retrieval models, 

• ontology-based information retrieval, 

• automatic Web content cataloguing and indexing. 
Web Agents: 

• dynamics of information sources, 

• e-mail filtering, 

• e-mail semi-automatic reply, 

• global information collecting, 

• information filtering, 

• navigation guides, 

• recommender systems, 

• remembrance agents, 

• reputation mechanisms, 

• resource intermediary and coordination mechanisms, 

• Web-based cooperative problem solving. 

Web Mining and Farming: 
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• data mining and knowledge discovery, 

• hypertext analysis and transformation, 

• learning user profiles, 

• multimedia data mining, 

• regularities in Web surfing and Internet congestion, 

• text mining, 

• Web-based ontology engineering, 

• Web-based reverse engineering, 

• Web farming, 

• Web-log mining, 

• Web warehousing. 

— Web-Based Applications: 

• business intelligence, 

• computational societies and markets, 

• conversational systems, 

• customer relationship management (CRM), 

• direct marketing, 

• electronic commerce and electronic business, 

• electronic library, 

• information markets, 

• price dynamics and pricing algorithms, 

• measuring and analyzing Web merchandising, 

• Web-based decision support systems, 

• Web-based distributed information systems, 

• Web-based electronic data interchange (EDI), 

• Web-based learning systems, 

• Web marketing, 

• Web publishing. 

It should be pointed out that WI researches are not limited to the topics 
listed above. We expect that new topics will be added, and existing topic will 
be regrouped or redefined. 

In summary, we can observe two ways in which WI research can be character- 
ized. The first one is by adding “Web” as a prefix to an existing topic. For exam- 
ple, from “digital library”, “information retrieval”, and “agents”, we can obtain 
“Web digital library” , “Web information retrieval” , and “Web agents” . On the 
other hand, we can add “on the Web” as a postfix. For example, we can obtain 
“digital library on the Web”, “information retrieval on the Web”, and “agent 
on the Web”. Our list of research topics is given by the prefix method. How- 
ever, we must avoid mistakes of seductive semantics as discussed by Bezdek pj. 
That is, “words or phrases which convey, by being interpreted in their ordinary 
(non-scientific) usage, a far more profound and substantial meaning about an 
algorithm or computational architecture than can be readily ascertained from 
the available theoretical and/or empirical evidence.” For a healthy development 
of Web Intelligence, we have to be more realistic about our goals and try to 
avoid over-selling of the subject. 
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6 Trends and Challenges of WI Related Research and 
Development 

Web Intelligence presents excellent opportunities and challenges for the research 
and development of new generation Web-based information processing technol- 
ogy, as well as for exploiting business intelligence. With the rapid growth of the 
Web, research and development on WI have received much attention. We expect 
that more attention will be focused on WI in the coming years. Many specific 
applications and systems have been proposed and studied. Several dominant 
trends can be observed and are briefly reviewed in this section. 

E-commerce is one of the most important applications of WI. The e-commerce 
activity that involves the end user is undergoing a significant revolution m- 
The ability to track users’ browsing behavior down to individual mouse clicks 
has brought the vendor and end customer closer than ever before. It is now pos- 
sible for a vendor to personalize his product message for individual customers 
at a massive scale. This is called targeted marketing or direct marketing ; 25 ]- 
Web mining and Web usage analysis play an important role in e-commerce for 
customer relationship management (CRM) and targeted marketing. Web min- 
ing is the use of data mining techniques to automatically discover and extract 
information from Web documents and services KMEE). Zhong et al. proposed 
a way of mining peculiar data and peculiarity rules that can be used for Web-log 
mining m They also proposed ways for targeted marketing by mining classi- 
fication rules and market value functions pirn A challenge is to explore the 
connection between Web mining and the related agent paradigm such as Web 
farming that is the systematic refining of information resources on the Web for 
business intelligence [H. 

Text analysis, retrieval, and Web based digital library is another fruitful 
research area in WI. Topics in this area include semantics model of the Web, 
text ming, automatic construction of citation. Abiteboul et al. systematically 
investigated the data on the Web and the features of semistructured data |I|. 
Zhong et al. studied text mining on the Web including automatic construction 
of ontology, e-mail filtering system, and Web-based e-business systems pip . 

Web based intelligent agents are aimed at improving a Web site or providing 
help to a user. Liu et al. worked on e-commerce agents m Liu and Zhong worked 
on Web agents and KDDA (Knowledge Discovery and Data Mining Agents) jSDJ 
l.'i I j . We believe that Web agents will be a very important issue. It is therefore not 
surprising that we decide to hold the WI conference in parallel to the Intelligent 
Agents conference. In the next section, we provide a more detailed description 
of intelligent Web agents. 

The Web itself has been studied from two aspects, the structure of the Web 
as a graph and the semantics of the Web. Studies on Web structures investi- 
gate several structural properties of graphs arising from the Web, including the 
graph of hyperlinks, and the graph induced by connections between distributed 
search servants. The study of the Web as a graph is not only fascinating in 
its own right, but also yields valuable insight into Web algorithms for crawling, 
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searching and community discovery, and the sociological phenomena which char- 
acterize its evolution [B.<. Studies of the semantics of the Web were initiated by 
Tim Berners-Lee, the creator of the World Wide Web Q|. The Web is referred to 
as the “semantic Web”, where information will be machine-processible in ways 
that support intelligent network services such as information brokers and search 
agents cum The semantic Web requires interoperability standards that ad- 
dress not only the syntactic form of documents but also the semantic content. A 
semantic Web also lets agents utilize all the data on all Web pages, allowing it to 
gain knowledge from one site and apply it to logical mappings on other sites for 
ontology-based Web retrieval and e-business intelligence. Ontologies and agent 
technology can play a crucial role in enabling such Web-based knowledge pro- 
cessing, sharing, and reuse between applications. A new DARPA program called 
DAML (DARPA Agent Markup Languages) is a step toward a “semantic Web” 
where agents, search engines and other programs can read DAML mark-up to 
decipher meaning rather than just the content on a Web site m- 

7 Intelligent Web Agents 

Intelligent agents are computational entities that are capable of making deci- 
sions on behalf of their users and self-improving their performance in dynam- 
ically changing and unpredictable task environments irifil.'lOl.'l ll.'Otl.'ll . In f27 \ . 
Liu provided a comprehensive overview of related research work in the field of 
autonomous agents and multi-agent systems, with an emphasis on its theoreti- 
cal and computational foundations as well as in-depth discussions on the useful 
techniques for developing various embodiments of agent-based systems, such as 
autonomous robots, collective vision and motion, autonomous animation, and 
search and segmentation agents. The core of those techniques is the notion of 
synthetic or emergent autonomy based on behavioral self-organization. 

Intelligent Web Agents (WA) are software programs that primarily serve two 
important roles: a), autonomous entities for exploring and exploiting Web-based 
services, and b). prototype entities for exhibiting and explaining Web-generated 
regularities. These two roles are summarized below. 

7.1 From WA to Web-Based Services 

The first role for WA can be readily described and appreciated by examining the 
following typical scenarios in which various tasks and objectives are achieved §5, 

EHHH3J. 

1. Personalized Multimodal Interface. WA can provide users with a 
user-friendly style of presentation that personalizes both the interaction with 
users and the content presentation. This activity involves the creation of var- 
ious cognitive aids, including tables, charts, executive summaries, indices, 
and personalized visual assistants ( e.g graphically animated personas and 
virtual-reality avatars). WA as interfaces must offer the ease of using elec- 
tronic services. The provided cognitive aids must be concise (i.e., accessible 
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with as fewer manipulations as possible and as less memorization as possible) 
and consistent (i.e., understandable based on users’ previously customized 
cognitive styles). 

2. Push and Pull. WA can play an important role in dynamically creating 
pull-and-push advertising. Here, by pull-and-push advertising we mean that 
a user expresses his or her favorites during the interaction with the agents 
(pull advertising) and in return the agents search and deliver the informa- 
tion about the favorite items dynamically to the user (push advertising). 
Such agents can also increase the positive externality of products, that is, 
the better people are informed about certain products, the more likely the 
products will be sold. 

3. Pattern Discovery and Self-Organization. WA will enable to detect 
what users’ buying patterns are forming and how they are structured, and 
hence effectively manage the online commerce. Collaborative recommenda- 
tion agents can help individual users aggregate into groups, which can in 
turn form a dynamical marketplace (for example, see my 

4. Information Gateway. WA can provide users with immediate access to 
the most relevant information. This support encompasses a wide spectrum of 
information filtering and delivery activities by manipulating various hetero- 
geneous Web sources including databases, data warehouses, newswire, finan- 
cial reports, newsletters, newsgroups, outbound emails, electronic bulletin 
boards, and hypermedia documents, and based on users’ profiles, tailoring 
and delivering the retrieved information to the users. The provided summary 
information must be just-in-time (i.e., delivered whenever is needed), rele- 
vant (i.e., focused on whichever topics the users are concerned with), and 
up-to-minute (i.e., refreshed whenever a new piece of information arrives). 
An example of applications with this type of agent support is comparison 
shopping that utilizes WA with mobile and filtering capabilities. Some re- 
lated experiences have been reported in ms&m . 

5. Reward. WA can motivate users to enter and re-enter a certain electronic 
service. While an ever- greater proliferation of content continues to consume 
individuals’ attention, e.g., through push technology to sell something or to 
support users, WA can play a crucial role in creating a captive audience, 
in educating it constantly, and even in removing away users’ old purchase 
habits. To be rewarding is to add value. The motivational rewards or in- 
centives can be created by offering free access to certain information and 
utility resources (e.g., free software download), opportunities to participate 
in multi-user information/commodity exchange activities (e.g., collaborative 
recommendation, chat, bidding, and auction), and scheduled plans for pro- 
motional deals. 

6. Matchmaking. WA can serve as a new means for trading commodities. 
Since the interests of users as well as the availability of products from dealers 
can change dynamically from time to time, what usually happens in present 
day electronic commerce is: (1) a dealer sells his or her items simply because 
these are the only items that he or she has at the moment, or (2) a user 
buys a certain item simply because it is the last item that he or she can find 
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that partially fits his or her need. WA-based customized business attempts to 
change the existing online buying and selling into the following new scenarios: 
(1) a dealer identifies and offers what exactly users are interested in, and (2) 
a user finds and purchases what he or she really loves - some technical issues 
related to matchmaking have been addressed in BG2E2- 

7. Decision. WA can assist Web users in making decisions. Such decision sup- 
port may be in the forms of evaluations or recommendations on the various 
features of certain specific items, cost-benefit analysis, inference support for 
optimizing utility and resources with respect to functional, time, and cost 
requirements, and model-based trend analysis and projections concerning 
new patterns of demand (for example, see (19|39lj ). 

8. Delegation. WA can act on behalf of Web users in online activities. The 
tasks that WA may delegate to achieve include matchmaking, server mon- 
itoring, negotiation, bidding, auction, transaction, transfer of goods, and 
follow-up support. This scenario will empower a new paradigm shift from 
user-centric to user-delegated electronic business. The delegations of these 
tasks may be carried out in either semi-autonomous (with users’ intervention 
on decisions) or fully autonomous manners. To this end, various computa- 
tional theories and models have been proposed and reported in para sa 

E2))- 

9. Collaborative Work Support. WA can offer the infrastructure support 
as well as the necessary function for collaboratively solving problems and 
managing workflow activities (for related examples, see IIUII9l'l2| l. 

7.2 From WA to Web-Generated Regularities 

The World Wide Web has evolved into a dynamic, distributed, heterogeneous, 
complex network, which is hard to control EEDj. To many people, whether Web 
developers or researchers who are concerned about the dynamics of complex 
systems E3 such as Internet, human community, and ecology, it has become 
imperative to truly understand and interpret (in addition to merely observe) the 
strong regularities emerged from the ‘messy’ universe of the World Wide Web. 
Up till now, there have been few efforts on describing different aspects of the 
orders in the World Wide Web pnrerrc] . However, as an entire system, the origin 
and interrelated elements of the regularities still remain unknown. 

Liu and Zhang m have designed and validated an agent-based model that 
takes into account Web topology, information distribution, and user interest 
profile to simulate user surfing behavior and explore the origin of regularities on 
the World Wide Web surfing. In their experiments, they have discovered that it 
is the unique distribution of user interest that leads to the regularities in user 
surfing behavior, i.e., a power law distribution of user surfing depth. The Web 
topology can only influence the shape parameters of the distribution without 
changing the nature of the distribution. Also discovered is that the power law 
of link click frequency is largely due to user purposeful surfing behavior. Their 
work shows that the regularities in the Web are interrelated and not artifacts of 
a particular surfing process. 
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Also in their studies, they have studied three categories of users, according 
to their interest and familiarity with the Web: Random users who have no ob- 
vious intention in Web surfing, rational users who have certain goals to achieve 
but are not familiar with the Web structure, recurrent users who have certain 
specific intents and are very familiar with the Web structure. The ability to 
predict the content at the next-level nodes becomes stronger when moving from 
random to recurrent users. The result of simulations with respect to the three 
user categories unveiled that the regularities of user surfing depth on pages and 
domains still remain the same, while a power law of link click frequency distri- 
bution will disappear as we move from recurrent users to random users. This 
result shows that the order existing in link click frequency comes from user’s 
content-prediction ability, that is whether or not a user can determine his/her 
next step according to his/her own interest and names of the hyperlinks. 



8 Web Intelligence Conference 

In order to meet the challenges of WI in the new information age, a new 
high-quality, higlr-impact international conference series, namely the Asia-Pacific 
Conference on Web Intelligence (WI) is initiated. WI-2001 is the first meeting in 
this new series (http://kis.maebashi-it.ac.jp/wi01). It is an international forum 
for researchers and practitioners to present the state-of-the-art in the develop- 
ment of Web intelligence, to examine performance characteristics of various ap- 
proaches in Web-based intelligent information technology, and to cross-fertilize 
ideas on the development of Web-based intelligent information systems among 
different domains. By idea-sharing and discussions on the underlying foundations 
and the enabling technologies of Web intelligence, we hope to stimulate future 
development of new models, new methodologies, and new tools for building a va- 
riety of embodiments of Intelligent Web Information Systems. By jointly holding 
WI conference and Intelligent Agents conference, we expect a close interaction 
between the two groups. 

The title “Web Intelligence” of the conference was chosen to reflect the dis- 
tinct feature that the conference is focused on intelligence aspects of Web and 
Web information systems. The name is short enough to catch attention to this 
important subheld. It is also general enough to attract contributions from all 
Web related research. 



9 Conclusion 

While it may be difficult to define what is exactly Web Intelligence (WI), one 
can easily argue for the need and necessity of creating such a subheld of study in 
computer science. With the rapid growth of the Web, we foresee a fast growing 
interest in Web Intelligence. 

Roughly speaking, we define Web Intelligence as a held that “exploits Arti- 
ficial Intelligence (AI) and advanced Information Technology (IT) on the Web 
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and Internet.” It may be viewed as a marriage of artificial intelligence and infor- 
mation technology in the new setting of the Web. By examining the scope and 
historical development of artificial intelligence, we discuss some fundamental is- 
sues of Web Intelligence in a similar manner. There is no doubt in our mind that 
results from AI and IT will influence the development of WI. 

Instead of searching for a precise and non-controversial definition of WI, we 
list topics that might be interested by a researcher working on Web related issues. 
In particular, we identify some challenging issues of WI, including e-commerce, 
studies of Web structures and Web semantics, Web information storage and 
retrieval, Web mining, and intelligent Web agents. 

We advocate for a new conference devoted to WI, namely, the Asia-Pacific 
Conference on Web Intelligence. The conference will be an international forum 
for researchers and practitioners to present the state-of-the-art in the develop- 
ment of Web intelligence, to examine performance characteristics of various ap- 
proaches in Web-based intelligent information technology, and to cross-fertilize 
ideas on the development of Web-based intelligent information systems among 
different domains. 
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Abstract. Good science periodically revisits old results in the context of new 
discoveries and technologies. In this way, new understanding is gained of the 
earlier results and, sometimes, new insights can be gained into current work. 
This in turn can lead to new discoveries, and so the process continues. In this 
paper, we revisit the generalization known as the “knowledge principle,” intro- 
duced more than twenty years ago to explain the source of power of expert sys- 
tems. We show that in a new context, the power of knowledge will come from 
the distribution and decentralization of knowledge that is ubiquitously devel- 
oped and applied. In the new semantic web concept, tools are provided to the 
large population of WWW users that allow those individuals (perhaps millions 
of them) to encode small bodies of knowledge that can be integrated into an ef- 
fective large knowledge base. The metaphor of “knowledge is power” thus 
changes from one of the centralized power to one of distributed power. 



1 Introduction 

In 1963, Feigenbaum and Feldman [1] edited the first major collection of papers in the 
Artificial Intelligence (AI) field. The book divided AI into two big parts. One part, 
called “Artificial Intelligence”, had an engineering motivation to construct intelligent 
computer programs, but was not concerned about whether the programs behaved in the 
same way that people behave intelligently. The second part, called “Simulation of 
Cognitive Processes,” reported work that had a motivation in Psychology, where the 
concern was to model accurately the processes that people used to behave intelli- 
gently. Both kinds of work have been present in the field over the past four decades. 
The work has brought great changes to both information technology and to psychol- 
ogy- 
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2 Previous Results about Knowledge in AI Systems 

The application of artificial intelligence led to a number of areas of AI research. Ex- 
pert Systems made the transition from university research to industry in the late 1970s 
and early 1980s. These programs generalized the approach developed in MYCIN and 
other early systems. Most of them used sets of rules about a limited domain to provide 
specific advice or offer diagnoses of the malfunction of complex devices and proc- 
esses. A key insight from the work of the past forty years is that computers have been 
able to achieve human-like behavior at world-class levels of performance, but only in 
limited domains. One of the most impressive demonstrations of this occurred in 1997, 
when the computer Deep Blue beat the world chess champion, Gary Kasparov. Kas- 
parov remarked in an interview that the computer move that upset him most was “al- 
most human.” Of course the subject matter of the chess game was not general, and the 
discourse was stylized to chess notation, a specific jargon of the chess domain. Deep 
Blue is unique only in that it arrived thirty years later than predicted famously by 
Professor Herbert Simon (one of the founders of AI) and was therefore eagerly 
awaited. 



Much earlier however, other AI programs were already behaving at world-class levels 
of performance. For example, as early as the mid-1970s, a computer program at Stan- 
ford University — the MYCIN program — was interacting with physicians about blood 
infections, asking questions about patient signs and symptoms, offering diagnoses, 
recommending antibiotic therapy, explaining both diagnosis and therapy recommen- 
dation in the stylized English and jargon that doctors speak to each other. MYCIN’s 
behavior was evaluated in a test that involved judgments by dozens of infectious dis- 
ease experts around the country. Its behavior was formally evaluated as essentially the 
same as that of the best doctors. MYCIN was an early example of a type of AI pro- 
gram called Expert Systems, so called to indicate that their behavior ranked them with 
experts in their field. 



3 Commercialization of AI as Expert Systems 

The application of artificial intelligence led to a number of areas of AI research. Ex- 
pert Systems made the transition from university research to industry in the late 1970s 
and early 1980s. These programs generalized the approach developed in MYCIN and 
other early systems. Most of them used sets of rules about a limited domain to provide 
specific advice or offer diagnoses of the malfunction of complex devices and proc- 
esses. 

Expert Systems used reasoning processes for inference, setting them apart from most 
other computer programs, which used arithmetic processes for calculation. The rea- 
soning of expert systems was typically done with some form of logic. The program 
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generally reasoned from collections (called knowledge bases) of knowledge acquired 
from people who were expert in the domain. 

Knowledge Bases (KB) contained a combination of: 

• Facts (a set of ground assertions that were assumed to be true in some domain), and 

• Heuristics (rules of good judgment, rules of “good guessing’’, acceptable shortcuts), 
which could be used to combine these facts in ways that provided inferential power. 

This knowledge was typically represented as logic expressions or logic production 
rules (IF... THEN...), and a program, often called an inference engine, was used to 
manage the process of combining rules and facts to reach conclusions. 

Expert System technology has been absorbed into the mainstream of commercial 
software development. In practical applications, they are often embedded into large 
software systems. For example, an equipment configuration expert system is embed- 
ded within a larger software system for helping salespeople sell complex equipment 
(“Salesbuilder” from Trilogy). 

Tens of thousands of Expert Systems were built worldwide during the past twenty 
years. As well as those mentioned above, other notable cases include: 

• Shopping advisors on the WWW 

• Microsoft hardware troubleshooting wizards and user-help wizards. 

• Tools for diagnosis of system failures and equipment failures 

• Decision aids for financial analysis and transactions 

• Tools for scheduling and planning of operations and manufacturing logistics 

In universities, research into new technologies for expert reasoning is also being pur- 
sued. On important technology is used for reasoning directly about probabilities and 
uncertainties using decision theory. There are hybrids of logic and probabilistic rea- 
soning (called Bayes nets or belief nets). The technology called “qualitative physics 
for device modeling” seeks methods for symbolic modeling of the physics and engi- 
neering of devices. Perhaps the best example of this was the experimental system, 
deployed on the NASA spacecraft Deep Space 1, that was allowed to control the entire 
spacecraft autonomously for a period of several days. 

What powered the Expert Systems to expert-level performance? 

In all of these systems, the most important part of the software is the Knowledge Base 
— the model of human expertise in the domain of discourse captured in a form suitable 
for reasoning by programs. Thus, the successful construction of an Expert System is 
an exercise in expertise modeling, and as such it is much more an exercise in episte- 
mology than technology. This work is quintessentially (in fact, almost by definition) 
interdisciplinary. For every system built, it was the knowledge from the domain of 
discourse, knowledge outside of Computer Science, that allowed the program to be- 
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have at expert levels of performance. In each case, the knowledge deployed was more 
important than the logical or probabilistic reasoning method employed. 



4 The Knowledge Principle 

One can view the many Expert System developments as a large number of experi- 
ments in the building of individual specialized AIs, each tailored to a particular do- 
main. Given this view, what principles can be extracted from this body of experi- 
mentation? 

The most important scientific generalization is that the power of an AI program to 
behave at high levels of intelligent performance lies in a systematic and extensive 
encoding of the knowledge of experts of the domain of performance. 

This is known as the knowledge-is-power hypothesis or the knowledge principle [2], 
Stated succinctly, this principle states that 

Knowledge is the primary source of the intellectual power of intelligent agents, 
both human and computer. 

Knowledge supplies the semantic basis for performance in all tasks requiring under- 
standing. That includes natural language understanding or any kind of situation under- 
standing or sensory understanding. The primary lesson learned from this principle is 
that, practically speaking, there is no shortcut via the elegant logical machinery of AI— 
that is, most of the time one cannot derive what one needs to know for adequate per- 
formance. The intelligent agent (human or computer) must have an encoding of the 
practical knowledge needed to reason about a given domain. 



5 Semantics 

Unfortunately, the knowledge principle comes with a corollary that is difficult for 
many in AI to accept. If we want to have significant levels of computational intelli- 
gence, we need significant bodies of knowledge in knowledge bases. Each domain of 
reasoning for which we want expert performance requires much engineering of the 
knowledge and much time spent in the acquisition of this knowledge from those who 
know the domain well. There is no “magic bullet” - significant performance requires 
significant knowledge engineering and the capture of many domain specific rules. 

There have been other AI approaches explored to overcome this so-called “knowledge 
acquisition bottleneck.” Symbolic machine learning, numerical learning approaches 
(such as those in neural networks) and “evolutionary” computing approaches have all 
been shown to have niches in which they perform well. In most cases, however, these 
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approaches have either not had the impact of the expert systems produced by the usual 
knowledge engineering methods; or they have required significant domain expertise 
for the design of the algorithms, training sets, etc. Nothing to date has weakened the 
knowledge principle and its corollary of the need for significant knowledge engineer- 
ing effort. 

The knowledge acquisition bottleneck becomes particularly troubling in the era of the 
World Wide Web. The ubiquity of (human readable) information accessible at the 
fingertips of millions of users drives a relentless demand for more intelligent computer 
assistance — to find, filter, integrate and visualize computer-retrievable data. If we 
must have a time-consuming and centralized knowledge engineering for each domain 
of discourse, the ubiquity of intelligent computer assistance for WWW users will not 
happen because of the enormity of the effort needed. 



6 The Semantic Web 

In a recent article [3], Berners-Lee, Hendler, and Lasilla discuss how we can get ubiq- 
uitous semantic knowledge out into the world of the WWW in a way that it can be 
harnessed to allow better human access to, and use of, information. This need is being 
forced by the huge size and penetration of the World Wide Web - which drives an 
enormous economic pressure to deliver more meaningful WWW behavior to custom- 
ers and users. Users need what the researchers at Xerox PARC call “sense-making” 
— semantic processes that aid the average computer user in turning information on the 
WWW into human accessible knowledge. 

The knowledge principle applies to this issue. Since knowledge is the basis for the 
semantics and supports all understanding, we need knowledge bases to match the large 
scope of user needs. Thus, the construction of very large knowledge bases, and 
probably the networking of many knowledge bases, is the number one task for AI, for 
the information industry, or possibly for a consortium of universities and non-profits 
organizations. Much actually has been done toward answering this need, but very little 
has been made robust and integrated into practice to date. 



7 Research on Very Large Knowledge Bases 

The idea of building very large knowledge bases is not new. The progenitor of all of 
these efforts is the so-called semantic network, a research topic of the 1960s and 
1970s. One strand of this research, and the one we focus on, was the building, by 
hand, of semantic networks representing very large amounts of knowledge in broad 



areas. 
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7.1 EDR 

In Japan there was a ten-year national research initiative in the 1980s and 1990s called 
Electronic Dictionary Research (EDR) to support research in natural language proc- 
essing (http://www.iijnet.or.jp/edr/). One of EDR’s products was the Concept Dic- 
tionary, a very large and carefully done semantic network (the language was Japanese 
but the concepts of course were largely universal). This excellent product was not 
transferred to commercial use, nor has it been made available in any language other 
than Japanese. 



7.2 CYC 

The major US project to build a very large knowledge base is the CYC project. 
CYC’s goal has been to encode in logical expressions a very large body of common 
sense knowledge. Started by an industrial consortium in 1984, its development is be- 
ing continued today by CYC Corporation. The CYC knowledge base now encodes 
millions of assertions and is clearly the best practical example of a very large knowl- 
edge base generated by a knowledge engineering process. 

CYC’s knowledge is layered from most general to most specific. The most general 
levels contain generic knowledge about such concepts as (these are examples) time, 
space, causality, intention, belief, emotions, substances, physical actions, and human 
capabilities (both physical and mental). There are about half a million assertions of 
this type. 

Other knowledge is somewhat more specialized, e.g. occupations, clothing, human- 
occupied dwellings, and information about specific humans. In addition, CYC con- 
tains a number of facts relating to specialty knowledge. These could be as diverse as 
the engineering of metal parts to how to perform various military maneuvers. CYC 
goes well beyond all other semantic networks in the detail and richness with which the 
knowledge is encoded. 



7.3 Word Net 

A massive semantic net project that is similar in motivation and methodology to the 
Japanese EDR project is Word Net. Word Net was started at Princeton University and 
has been built with the help of many American and European researchers. Word Net is 
a semantic net of terms, synonyms and relationships. To date it is a research tool and, 
like EDR and CYC, has received little application in the information industry. Other 
major projects involving semantic encoding have been done by the US National Li- 
brary of Medicine), by the IEEE, and by NASA. 
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7.4 HPKB and RKF 

The American organization DARPA, The Defense Advanced Research Projects 
Agency, has been sponsoring the two largest American projects of AI research and 
development in recent years: the High-Performance Knowledge Base (HPKB) initia- 
tive and the ongoing Rapid Knowledge Formation (RKF) project 
(http://reliant.teknowledge.com/RKF/). 

In HPKB, DARPA hoped to stimulate new and different technology in the spirit of 
CYC’s large knowledge base. Issues explored included the technology for scaling up 
the size of knowledge bases; the systematic encoding of more knowledge (in the spirit 
of CYC’s motivation); and the application of large knowledge bases to interesting 
classes of military "crisis management" problems (supporting intelligence analysts or 
their automated agents in interpreting international events and providing a corporate 
memory about past international crises). The HPKB project was successful, but the 
applications showed that the limiting factor for success was the speed at which people 
could encode knowledge accurately into the knowledge bases. This is the same 
“knowledge acquisition bottleneck” that was described by one of this paper’s authors 
as long ago as 1977! 

DARPA launched a successor project. The RKF project has as its goal increasing the 
speed of knowledge engineering and perhaps automating part or all of the process. 
More importantly, RKF focuses on the building of tools and processes to make it pos- 
sible for subject matter experts to author their own very large knowledge bases, in- 
stead of requiring AI researchers to do the knowledge engineering. For example, a 
test problem used in RKF is for students in biology to be able to use the tools them- 
selves to encode the knowledge encapsulated in a standard university biology text- 
book. 

Despite these efforts, it is clear that the development and deployment of very large 
knowledge bases is not helping the engineering of semantic web assistants - making 
knowledge of a great variety of domains, across a great variety of applications, avail- 
able to a great variety of users. The design of large and correct knowledge bases, 
requires effort that simply cannot cope with the scope, complexity, and exponential 
growth speed of the WWW and modern information sources. 



8 Web Ontology Languages 

Until now, the approach to building the very large knowledge base that is needed has 
been mainly a centralized approach (as exemplified by CYC and EDR). Word Net 
uses a more decentralized approach. It is probably the case that a much more decen- 
tralized approach is needed to recruit the effort of a large number of people in helping 
to build the knowledge base(s) to support intelligent assistance for WWW applica- 
tions. 
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Such an approach has been emerging. In this approach, users of the WWW are given 
the languages and tools to encode conceptual content by themselves. Just as giving the 
users tools (based on HTML) to encode document format descriptions led to an im- 
mense and unexpected explosion of the number of web pages, we believe tools can be 
developed that will let users develop and use knowledge bases on the web. Current 
web languages, such as XML and RDF, are starting to provide the basis on which such 
new tools can be developed. Though it is early in the exploration of this approach, 
some users are beginning to have an ability to encode content and concept. In this 
way, XML and its later extensions have the potential to trigger an explosion of com- 
puter-encoded knowledge. XML itself has been widely and rapidly appreciated by the 
information technology industry. Yet XML is just the first simple step down a long 
and very important path. 



A critical starting place for this explosion of knowledge encoding is the creation of 
web languages for ontologies. In today’s networked world, ontologies are usually just 
taxonomies of terms linked by relationships to give logical order to a large set of con- 
cepts and objects. Because ontologies have a formal logical structure with which they 
index their world of symbols, they can be used as a framework for making inferences 
about these symbols. 



8.1 Commercial Ontologies 

Within the e-commerce world, ontologies are beginning to become important corpo- 
rate IT tools. What was in 1990 just an arcane topic in the AI research community is 
now a thriving industrial activity among portal companies and the corporate web sites 
of large companies. In each case, the need is to give logical organization to a large 
amount of information on web pages, and to aid search engines to produce more rele- 
vant and categorized results. 

The first, and still probably the largest, such ontology is the one that launched Yahoo 
as a business. Yahoo was first a personal project of two Stanford University graduate 
students, then later a business. One of Yahoo’s first technical employees had the title 
of “Chief Ontologist.” Other industry efforts include “vertical industry portals” which 
use ontologies that are specific to an industry (such as steel or chemicals). 



8.2 The Language SHOE 

In the AI community, the need to represent ontological information in a web- 
embedded form was realized a number of years ago with the advent of the language 
SHOE, developed at the University of Maryland [4], It extended HTML to allow 
users to specify class and subclass relations, properties of classes, and a limited set of 
inference rules. A tool set for manipulating and using ontologies on the web was also 
designed. SHOE was used in a number of academic efforts in the US and abroad, but 
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was not widely accepted, in part because the HTML basis of the language didn’t have 
quite the power needed for building web tools that could easily coexist within existing 
browsers. 

With the advent of XML and RDF, web languages could more naturally encode the 
predicates and relationships needed for ontologies. RDF-Schema, a proposed “web 
recommendation”, added some SHOE-like properties to RDF. The European lan- 
guage OIL, is also built on RDF to allow ontological design (using a description logic 
approach). In the USA, DARPA is supporting the design of the “DARPA Agent 
Markup Language” (DAML) and sponsors researchers to develop tools and applica- 
tions that would exhibit and demonstrate the power of semantics on the WWW. Re- 
cently, a unification of DAML and OIL has occurred, and the resulting DAML+OIL 
language is now being considered as the basis of a “web recommendation” for a web 
ontology language. (DAML+OIL uses a proper XML encoding, thus allowing it eas- 
ily to build on existing XML browsing tools, style sheets, etc. See 
http://www.daml.org/ for language details, tools, etc.) 



8.3 DAML+OIL 

The DAML effort and DAML+OIL language can be considered the first in a modern 
series of experiments with a more distributed and diverse approach to knowledge 
bases and knowledge engineering. As of June, 2001, a repository of DAML+OIL 
ontologies holds over 150 contributions (with a total of 3400 properties expressed 
about 17,500 classes). In addition, a new semantic web crawler 
(http://www.daml.org/crawler) has collected 2,450,000 DAML statements that de- 
scribe documents on the WWW, and the number is increasing constantly. We see that 
the idea of enabling many users to create and edit small individual knowledge bases is 
starting to create a new kind of very large knowledge base. Importantly, the distribu- 
tion of effort over many people will allow this knowledge base to become very large 
very quickly. 

The knowledge principle of course will still apply. An investigation of the ontologies 
in the DAML repository reveals that many are very incomplete (containing only a 
small amount of facts about a wide range of topics) and that there are inconsistencies 
between them. Thus, a term such as “person” appears in many different ontologies, 
but in some may contain no properties other than an identifier, while in others may 
contain a larger set of restricted features. The super classes of person include items 
like agent, entity, primate and thing while subclasses include employee, researcher, 
project-leader among others. Thus, a user coming to the semantic web, wishing to use 
it for a more complex reasoning task, will either need mechanisms for collecting and 
merging definitions, or will be unable to use it for any sophisticated problem solving. 
In short, the lack of engineered knowledge is felt, even though the quantity and variety 
of terms is growing at a rapid rate. 
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9 Bridging between Knowledge Bases, Large and Small 

Several research efforts, still very much in their infancy, seem to us to hold a very 
exciting direction for the future of the semantic web. These efforts focus on using 
large knowledge bases, such as Word Net and CYC, to provide the semantics for pro- 
viding mappings between smaller ontologies. In short, the large carefully engineered 
ontologies may hold the key to providing significant computational power in the use of 
the more distributed and localized ontologies. Small communities can develop their 
own ontologies and terms. The larger knowledge bases can be used to help (in auto- 
mated or semi-automated ways) in the discovery of common terms and the develop- 
ment of “articulation” ontologies [5]. Term “dictionaries” can provide a mapping, full 
or partial, between ontologies. For example, an articulation rule located on the web 
page http://www.cs.umd.edu/users/hendler/jhendler.daml asserts that the term "proj- 
ect" defined in an ontology called "atlas-emu" is EquivalentTo (i.e. has the same 
meaning as) the term "organization" located in a general ontology developed at the 
University of Maryland. 



10 Another Look at the Metaphor of Power 

We can now see a new way to think about “knowledge is power”. In the earlier Ex- 
pert System experiments, where reasoning was restricted to relatively small specialty 
domains, a single centralized knowledge base provided major leverage. In this “cen- 
tralized” sense, the power of the knowledge is analogous to that of a centralized elec- 
tric dynamo in a power plant. From the emerging semantic web experiments, we 
foresee that knowledge will be much more distributed, and in smaller packages. In the 
analogy, this is the power found in electrical batteries. Knowledge is power, but in 
small amounts of power. 



We foresee a unification of large and small — using distributed ontology creation and 
tools for the creation of a wide range of small efforts, coupled with a few larger efforts 
to provide large and powerful knowledge bases. In the analogy, the dynamos are 
Word Net, CYC, etc. The distribution is accomplished using tools for distribution, 
such as the DAML ontologies. The WWW provides the “wires” to connect. 

The analogy leads to a vision similar to the early days of the coming of electrical 
power, and the evolution of this into the current large-scale power grids. The meta- 
phor of “knowledge is power” thus changes from one of centralized power of the 
dynamo to the distributed power of the modern electrical grid, which can deliver elec- 
tricity into an extremely wide range of small things. These small things, since they 
can rely on a plentiful supply of widely accessible electricity, can have more power 
than could be provided simply by individual batteries, and the batteries themselves can 
be “recharged” when connected to the grid. 
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10.1 A Vision of the Knowledge-Connected Portable 

The idea of “recharging” is starting to be explored in several small experimental ef- 
forts. The PalmDAML application described at http://www.daml.org/PalmDAML is a 
good example. A tool was developed to download DAML ontologies and instances on 
to a PalmPilot™. When connected to the web, the user can choose from a large vari- 
ety of types of information to load. The complete set of ontologies, however, is far 
too large to fit on the small device. The user chooses what libraries to download based 
on current preferences. Demonstration examples include the genealogy of the British 
Royal family, the military term ontology used by the US Center for Army Lessons 
Learned, and several others. A Palm application can then be used to browse this on- 
tological information, which can be used when uncoupled from the WWW. 



This sort of application is still primitive compared to others currently envisioned. 
However, it shows the tremendous potential possible in the distribution of knowledge 
and its use in localized applications. Coupled with the term mapping and merging and 
the more sophisticated rule sets proposed for the larger WWW Knowledge Base ef- 
forts, we can envision the potential for the future semantic-web based “knowledge 
grid.” 



11 A Challenge for Twenty-First Century AI Research 

The Semantic Web, as we envision it evolving, will not primarily be comprised of 
carefully groomed ontologies that skilled knowledge engineers have constructed. In- 
stead, we envision a complex Web of semantics ruled by the same sort of anarchy that 
rules the rest of the Web. Instead of a few large, complex, consistent ontologies that 
great numbers of users share, we may see a world in which a great number of small 
ontological components exist, often consisting of pointers to other Web ontologies. 
Web users will develop these components in much the same way that Web content is 
created - via tools that make it easy to do layout and web design, while capturing 
knowledge in the background. 



The challenge to AI researchers is to create knowledge development tools that are 
usable by non-experts without a strong knowledge engineering background. These 
tools are a precondition for recruiting the millions of web page contributors into the 
effort to build the enormous Knowledge Base that will be needed to bring intelligence 
to the Web. With a spiral development of such tools, with content created automati- 
cally, and with techniques that use that knowledge to do new and exciting things for 
users of the Web, we may yet reach the long-range goal of Web intelligence. 
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Abstract. The advent of the internet has changed the world in possibly more 
significant ways than any other event in the history of humanity. Is internet 
access and use beyond the reach of ordinary people with ordinary intelli- 
gence? Ignoring for the moment economic issues of access for all citizenry, 
what is it about internet access and use that hinders more widespread accept- 
ability? We explore several issues, not exclusive, that attempt to provoke 
and poke at answers to these simple questions. Largely speculative, as in- 
vited talks ought to be, we explore 3 topics, well studied but as yet generally 
unsolved, in computational intelligence and explore their impact on web in- 
telligence. These topics are machine translation, machine learning, and user 
interface design. Conclusion will be mine; readers will draw general conclu- 
sions. 

Keywords: Computational Intelligence, Web Intelligence, Data Mining, 
User Interfaces, Natural Language Processing, Machine Translation, Ma- 
chine Learning. 



1. Introduction 

By the time you have completed reading this sentence you will have understood its 
meaning. Your achievement and success in understanding is most impressive. The 
speaker’s task is much simpler - to generate an utterance which conveys a presumably 
preexisting thought. Your task as listener is to decide what the speaker must have 
been thinking in order to motivate his utterance in the particular context in which he 
uttered it. In general, understanding a natural language (NL) is simply miraculous. 

NL represents an important modality for human computer interactions, from sim- 
ple NL interfaces to databases to machine translation to more general answer- 
extraction and question answering systems. Other important modalities, e.g., speech, 
pointing devices, graphical user interfaces, etc. remain. The perfection and integration 
of multimodal systems takes on new importance when we transpose previous solu- 
tions to the internet. Systems which can communicate in natural ways and can learn 
from interactions are key to long term success transferring computational to web in- 
telligence. 

Although this talk will be somewhat speculative, as invited talks ought to be, we 
consider three aspects, well studied but as yet generally unsolved, in computational 
intelligence and we explore the impact of these topics for web intelligence. These 



N. Zhong et al. (Eds.): WI 2001, LNAI 2198, pp. 30-42, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 



From Computational Intelligence to Web Intelligence 



31 



three topics include natural language processing (particularly machine translation), 
machine learning, and user interface design. 



2. Natural Language Interfaces 

In the 1990’ s Simon Fraser University researchers were engaged in a long-term proj- 
ect entitled “Assessing Information with Ordinary Language” which has found its re- 
alization in several versions of SystemX. Initial SystemX NL interface prototypes 
were modularly designed utilizing proven technologies, e.g., augmented transition 
network grammars. SystemX served as an umbrella project for new ideas and tech- 
nologies to be incorporated [1], as a testbed for various techniques espoused by 
graduate students [2] and for experimenting with nonstandard or incompletely speci- 
fied theories [3], 

Subsequent to these early experiments, the use of techniques pioneered in Sys- 
temX led to research into NL access to internet search engines. This effort was de- 
signed to make access to relevant information on the internet more easily accessible to 
ordinary users without the need for sophisticated knowledge of appropriate keyword 
selection or complicated Boolean expression evaluation. This research led to two in- 
teresting NL systems for internet information access [4,5], 

We briefly illustrate these systems with examples in order to track the progression 
of research and the directions of change imposed by the advent of the internet. 

2.1 SystemX 

At Roger’s Cablesystems Ltd., the vice president for customer service enters the fol- 
lowing into his computer terminal, “Give me the Western region outage log for June”. 
Within seconds SystemX presents him with a neatly formatted table (or graph) of the 
data retrieved from Rogers’ relational database. He could have said, “What’s the out- 
age log for the Western region for June?”, or “Tell me the June regional outage log 
for the West.” or “Find the Western outages for June.”, etc. SystemX can determine 
that whichever phrase he uses, he means the same thing. Such flexibility in parsing, 
applying the logical rules of grammar to determine meaning, is nontrivial. SystemX’ s 
parsing techniques are described in [6]. After parsing, SystemX reformulates the 
question in SQL (structured query language) and data is extracted for presentation 
from Roger’s large central database. 

The nontrivial problem described in the preceding paragraph is but one of a large 
number of very difficult problems of understanding NL by computer. Fortunately, a 
NL interface is simpler to comprehend. Although one ultimately encounters problems 
comparable to the unconstrained NL understanding situation, the domain of discourse, 
and thereby the context, is highly constrained by the database schema. General analy- 
sis of language phenomena and much of the ambiguity inherent in NL understanding 
is limited but complexities arise when building NL capabilities into database inter- 
faces. One quickly comes to realize that domain knowledge is required in order to in- 
terpret queries, in order to answer queries, and that modeling the user is important as 
well [7], 

We present an example of SystemX accepting an English query from Rogers’ vice 
president, translating the query into SQL, retrieving data from Rogers’ database and 
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displaying the data in the format (table or graphical trend) specified by the user in the 
query. SystemX permits the user to generate printed copy of the data. This function- 
ality is currently accessed by way of a menu interface, see Figure 1 . The user may ex- 
amine the SQL generated in order to assure himself of the validity of the response. 

SystemX is able to display responses to requests for trends in statistical data 
graphically. The user has the choice of inputting his trend request using English, using 
menus (in the case of “canned” trends) or using a combination of English and menu 
responses. Various input modalities are provided as a convenience to users. The 
“canned” trends display data that is predictably desired on a reasonably frequent ba- 
sis. They may be accessed for a minimum of keystrokes. The “canned” trends are 
those available through the first eight menu items in the Trend Menu in Figure 1. 

Specifying a request for a trend in English may become quite cumbersome if de- 
fault parameters (specifying timing and so forth) are not employed. The complex 
statements required are difficult to formulate and demand patience on the part of the 
users while waiting parsing. The system therefore allows the users to request ad-hoc 
trends using a combination of English and responses to menus. This combination of 
modality reduces the task of specifying a complex query into a set of simple tasks that 
are accomplished in sequence. The system accesses the database in order to be able to 
present tasks to users in as helpful a manner as possible. 



1 New query 

2 Display SQL 

3 Save Last Response 

4 Print Saved Reponses 

5 Display a Trend 

6 Print Last Trend 

7 Make a Comment 

8 Stop 

WHAT NEXT? > 5 
Display a Trend 



Trend Menu 

1 # of C/S Representatives 

2 Service Call Ratios 

3 System Reliability 

4 Subscribers per Employee 

5 Subscribers per Km of Plant 

6 Maintenance Calls 

7 Repair Calls per TSR FTE 

8 Repair Calls per TSR Hours 

9 Specify an Ad Hoc Trend 

10 Cancel Trend Request 
WHICH TREND? > 9 
Specify an Ad Hoc Trend 



Enter phrase describing data to be graphed. 
(Return to cancel) : the unscheduled outages 



Parsing Completed. 

Do you uish the current, 12 month trend for the West' 





Fig. 1. The SystemX main menu and a trend menu. 



2.2 NLAISE and MATISE 

Despite the many search engines available, searching for a relevant site remains diffi- 
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cult. One major reason for this difficulty is that search engines do not analyze queries 
semantically; in contrast, most search engines perform keyword matching. 

How can our use of NL semantics improve internet searching? There have been 
many applications of automated NL understanding such as speech understanding, in- 
formation retrieval, question-answering systems, machine translation etc. One com- 
mon application provides a NL “front-end”, which enables users to access database 
information without any need to know database structure or any query language, and 
with no need for query transformation to some other representation, [1]. Our NL 
“front-end” to Internet search engines, which allows users to utilize search engines 
without finding appropriate search terms, is presented in [4,5]. For a search for: “I 
want to book a flight ticket” or “Show me some sites on online reservation of flight 
tickets” or phrases like “online reservation of flight tickets”, these queries would yield 
the same search results. 

Figure 2 shows the representation of existing search engines compared with the 
NL “front-end” we provide. Using search engines, the user must study various options 
pro- 
vided and transform the query into a form suitable for the specific search engine. The 
NL “front-end” analyses the NL query of the user and transforms it into appropriate 
search terms. 



USER 



list of keywords 
or other 




EXISTING SEARCH ENGINES 



search 




SEARCH ENGINES WITH NATURAL LANGUAGE "FRONT-END" 



Fig. 2. Representation of Search Engines. 



NLAISE allows users to choose the search engine best suited for their search and 
enter the query in English. The NL query is analyzed both syntactically and semanti- 
cally in order to select the most appropriate keywords describing sought information. 
Keywords are interpreted to provide more meaningful search terms by using keyword 
synonyms in conjunction with Boolean operators supported by specific search en- 
gines. 

Figure 3 illustrates the architecture of NLAISE. The NL query, along with the 
choice of search engine, is pre-processed in order to transform the query into a form 
suitable for input to the parser. The parser, in turn, has a description of grammar rules 
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natural complex 




Fig. 3. Natural Language Access to Internet Search Engines (NLAISE). 

for capturing the constraints of English and a lexicon that contains the words permit- 
ted as input. The Head Driven Phrase Structure (HPSG) parser generates a complex 
feature structure representing the query, [3]. The semantic content of such a complex 
feature structure is extracted, interpreted and transformed into a form suitable for the 
search engine that was selected. Figure 4 shows the output from asking NLAISE to 
parse the phrase “I want to schedule a trip to Japan” and generate appropriate key- 
words for search engine examination. NLAISE was also requested to use Infoseek as 
the search engine. Inspection of the 1,473 web pages returned verified that 80% were 
relevant. Note the choice of keywords “Japan” and “travel” which indicates the level 
of sophistication of NLAISE’ s semantic interpretation of the original input phrase. 




Fig. 4. Inforseek output for “I want to schedule a trip to Japan” using NLAISE. 
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EMATISE extended NLAISE in 3 user-oriented ways: (1) whereas NLAISE was 
tied to a single “travel” domain, EMATISE greatly enhanced semantic interpretation 
to eliminate much ambiguity and toil over multiple domains; (2) EMATISE sent out 
term expanded queries to multiple search engines in parallel and reranked results re- 
turned from these search engines into a single relevant high precision list for the user; 
and (3) EMATISE’ s higher level of abstraction above conventional search services 
presented the user with a single, central and natural search interface with which to in- 
teract. 

Figure 5 shows the results of EMATISE after a simple translation of the sentence 
“I want to visit the homepage of IBM product review” into search engine neutral 
search terms, term expanded by the drivers for particular search engines. Figure 5 il- 
lustrates the results of this query after the aggregation engine assembles the results. 



3. Generate and Repair Machine Translation (GRMT) 

Imagine picking up the phone in Toronto, dialing your Japanese program co-chairman 
in Tokyo to explain several papers lost in the shuffle of email systems. You speak 
English and she speaks Japanese. Fortunately it is 2010 and the English you speak in 
Toronto is automatically translated into Japanese in the time it takes to transfer your 
words over the phone lines. Impossible, - probably not. The world of machine trans- 
lation has both fascinated and frustrated researchers for over 50 years. More recent 
successes in statistical, nonlinguistic and hybrid systems have given hope that we will 
not be confined to the traditional direct, transfer and intralingual approaches that have 
dominated in the past. An informative critique of these traditional approaches is given 
in [8]. We provide another approach following from computer science methodology: 
generate and repair machine translation. (GRMT). 




Fig. 5. EMATISE output -“I want to visit the homepage of ibm product review”. 
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GRMT (Figure 6) is composed of 3 phases: “Quick and Dirty MT”, “Translation 
Candidate Interpretation (TCI)” and “Repair and Iterate (RI)”. “QDMT” generates 
translation candidates (TC) by considering syntactic and semantic differences be- 
tween language pairs without performing any sophisticated analysis. This ensures that 
the TC can be generated quickly, simply and efficiently. Next, the system interprets 
the TC to see if it retains the meaning of the SL. If so, that TC will be considered a 
translation. If not, that TC will be repaired based on the diagnosis that is indicated in 
the second phase, TCI. Subsequently the repaired TC will be re-interpreted to deter- 
mine if it still has a different meaning from the SL. These two processes iterate until 
the TC conveys the same meaning as the SL. The TCI and RI stages ensure the accu- 
racy of the translation result. They also guarantee the accuracy of the translation back 
from the TL to SL. 



SL TL 




Fig. 6. GRMT Architecture 



GRMT treats SL and TL separately and is aware of the differences between lan- 
guages. Therefore, if languages can be grouped according to various characteristics, 
e.g., auxiliary verb, continuous tenses, passive voice, etc., which they have in com- 
mon, then the translation between groups can be performed more simply by GRMT. 
For example: Group 1 consists of English, French and Spanish, Group 2 consists of 
Chinese, Japanese and Thai. To perform the translation between these two groups, 
the transfer approach requires six SL analyzers, six TL generations and 18 sets of 
transfer rules while GRMT requires six SL TCIs, six TL TCIs and 2 sets of constraint 
applications. 

Table 1. Examples of Generated TC 



1 . The wheat was yellow. 

TC: <UE“« 7T— E7T 

CT: <EE“« 7T— E7I ’ftA-QOIS 

2. Five days before the trout are released. 

TC: AE“ «— n °16n £’E n—En ®- dY° a «EO- 

CT: AE“ «—n °EOn £’E n—En ®- dY° a «EO- 

3. You can take a fish to school but you can not make them think. 

TC: §y> “i‘Vd $6“ a ~“ fl — «A7T-fEJS %o a QY15 */iE §y> %oiE “i‘Vd £”„AE 
a3«°$<l5“ §‘¥ 

CT: §y> “i‘Vd $6“ a ~“ j u — «An+E£ %o a ,V r JS^V r, - | 7r */jE §y> %oiE “\“Vd 
P’„AE ae«°$<fr“ §‘¥ 

4. When I was an ugly duckling he thought I never dreamed I could be so 
happy. 

TC: fi(>E6 ©— n ^An =Y°f a A¥ [i—«An+E& §‘¥ ©—n %oiE#§- Q—n 

© — n “t“Vd ± a An §«“i y<t ^’EfA/E i“° 

CT:fi(>E6 © — n ^A n ~Y°f a A¥ ^’E^AV'E fj — «A7T-fEJS f §‘¥ © — n 

%oiE^§-' Q — n © — n “\“Vd P §«“i y<t i“° 
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Initial experiments of QDMT (English to Thai) indicate that TCs can be generated 
with relative accuracy. Table 1 shows example results. Generated TCs of Examples 1 
& 2 can be the translation results of each without the need of the RI stage because 
they are exactly the same as the correct translations(CT). There is one inappropriate 
word (underlined word) in the TC of Example 3. However, this inappropriately se- 
lected word can be diagnosed and replaced in TCI and RI stages. In example 4, There 
is one inappropriate selected word and one misordered word that can be simply re- 
paired. 



4. Machine Learning: Rule Induction and Classification 

Research into data classification and rule induction carried out at the University of 
Waterloo, has resulted in the development of the ELEM2 system. ELEM2 outper- 
forms existing rule induction methods and has many potential applications, including 
the data classification and rule induction application role for a molecular compound 
database used by GlaxoKleinSmith. For a detailed description of the ELEM2 system, 
see [9-11]. 

ELEM2 induces decision rules from a set of observed data and classifies new ex- 
amples by applying the induced rules [9]. ELEM2 is distinguished from other rule in- 
duction systems in that it employs several new strategies in the induction and classifi- 
cation processes. First, ELEM2 uses a new heuristic function for evaluating attribute- 
value pairs in the induction process. The function reflects the degree of relevance of 
an attribute -value pair to a target concept and leads to selection of the most relevant 
pairs for formulating rules. Second, ELEM2 handles inconsistent training examples 
by defining an unlearnable region of a concept based on the probability distribution of 
that concept in the training data. The unlearnable region is used as a stopping criterion 
for the concept learning process, which resolves conflicts without removing inconsis- 
tent examples before rule induction. Third, ELEM2 employs several methods for the 
discretization of continuous attributes for learning classification rules, thus providing 
a choice to the user of the best method. In addition to the traditional equi-distant in- 
terval and equi-density methods, ELEM2 provides a new measure based on informa- 
tion entropy [10]. Fourth, ELEM2 employs a new rule quality measure for the pur- 
pose of handling imperfect data by post-pruning generated rules [11]. The measure is 
defined according to the relative distribution of a rule with respect to positive and 
negative examples and is chosen from 4 alternatives that represent different kinds of 
distributions. 

All four strategies reported above provide ELEM2 with its unique character that 
has imbued ELEM2 with incremental improvement over competing systems and 
tested over a wide variety of datasets. We have tested ELEM2 using over 50 different 
datasets from the UCI Repository and with several commercial datasets provided by 
Knowledge- Junction Systems, GlaxoKleinSmith, etc. It is perhaps the rule quality 
measures that hold promise for the greatest overall performance enhancement. 

A rule induction system generates decision rules from a set of training data. The 
set of decision rules determines the performance of a classifier that exploits the rules 
to classify unseen objects, for example unknown words in text. It is therefore impor- 
tant for a rule induction system to generate decision rules that have high predictability 
or reliability. These properties are commonly measured by a function called rule 
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quality. A rule quality measure is needed in both the rule induction and classification 
processes. A rule induction process is usually considered as a search over a hypothe- 
sis space of possible rules for a decision rule that satisfies some criterion. The possi- 
ble rules, in this case, are those rules that are defined by a concept description lan- 
guage, such as prepositional rules. In the rule induction process that is based on gen- 
eral-to-specific search (such as CN2 [12]) a rule quality measure can be used as an 
evaluation heuristic to select attribute-value pairs in the rule specialization process; 
and/or it can be employed as a significance measure to stop further specialization. The 
main reason to focus special attention on the stopping criterion can be found in the 
studies on small disjunct problems [13]. The studies indicated that small disjuncts, 
which cover a small number of training examples, are much more error prone than 
large disjuncts that cover a large amount of training examples. To prevent small dis- 
juncts, a stopping criterion based on rule consistency (i.e., the rule is consistent with 
the training examples) is not suggested for use in rule induction. Other criteria, such 
as the G2 likelihood ratio statistic as used in CN2 and the degree of logical suffi- 
ciency have been proposed to pre-prune a rule to avoid overspecialization of the rule. 
Some rule induction systems, such as C4.5 [14] and ELEM2, use an alternative strat- 
egy to prevent the small disjunct problem. In these systems, the rule specialization 
process is allowed to run to completion (i.e., it forms a rule that is consistent with the 
training data or as nearly consistent as possible) and post-prunes overfitted rules by 
removing components that are deemed unreliable. Similar to pre-pruning, a criterion 
is needed in post-pruning to determine when to stop this generalization process. 

A rule quality measure is also needed in the classification process. It is possible 
that in the decision making process an unseen example satisfies multiple decision 
rules that are assigned to different classes. In this situation, some conflict resolution 
scheme must be applied to assign the unseen object to the most appropriate class. It is 
therefore useful for each rule to be associated with a numerical factor which can rep- 
resent its classification power, its reliability, etc. 



5. Brainstorms 

Successes mentioned in sections 2-4, and others like them, represent contemporary 
computational intelligence solutions. How do we adapt them to represent web intelli- 
gence solutions? We briefly describe current work designed to make useful solutions 
to computational intelligence problems amenable to web intelligence. Some of this 
work takes advantage of newer technologies already beginning to show up in web ap- 
plications (agent architectures, recommender systems, information extraction tools, 
etc.). This curren work represents an intermediate step along the way to web intelli- 
gence. It necessarily leads to the realization that more adaptable and more general 
machine learning strategies need developed and incorporated into every aspect of web 
intelligence. One glaring example would be learning the meaning of unknown or un- 
defined words, for machine translation and general speech and NL processing. 

5.1 Java Parsers, Just-in-Time Subgrammar Extraction, Modular HPSG’s 

Stefy is a NL parser implemented in Java, based on HPSGs [15], It is part of a larger 
project to implement a NL processing system for Internet information retrieval (IR). 
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This IR task requires Java applets capable of parsing a NL. Earlier we discussed work 
on developing HPSG parsers. However, Stefy is one of the first implemented in Java. 
Java was chosen for two reasons. Java supports dynamic class loading and object se- 
rialization, which are important features necessary for our concept of distributed NL 
processing. Java is a good prototyping language, compared to C++ for example, and 
facilitates easy experimentation with various approaches, which makes this shift in 
programming language paradigm less drastic. 

A drawback of our implementation is that it is not suitable for development of the 
grammar and lexical resources. Other systems, like ALE [19] and LKB [20], are more 
appropriate for this task. After a grammar or a lexicon is developed in one of those 
systems, it is translated into a Java description and used in Stefy. 

Stefy represents a new precise and compact description of the HPSG formalism, 
which is especially suitable for implementation of HPSG parsers in low-level lan- 
guages. Stefy represents an important step towards applying HPSG formalism in the 
area of distributed NLP and answer extraction. 

Stefy’ s approach is similar to the filtering techniques, which are a recognized way 
to improve parser's performance. However, Stefy is different because we insist that 
the filtered, i.e., extracted, knowledge is in the form of a grammar. This approach is 
theoretically sound, and in practice it provides a clean interface between subgrammar- 
extraction part and the parser. More arguments for this separation of the subgrammar 
extraction and parsing are given in [17]. 

An important part of the HPSG subgrammar extraction is the extraction of the cor- 
responding type sub-hierarchy out of the original hierarchy. Efficient type operations 
and representation of the types are used in approximate algorithm for subgrammar 
extraction for HPSGs. 

Recently, there has been a lot of research activity in the area of grammar modu- 
larity. Some of the motivational factors for this work are the following: 

• managing complexity. The NL grammars used in NL processing are large and complex. 

The difficult problems are designing, creating, testing, and maintaining them. Using 
smaller modules that are combined into larger grammars addresses the complexity 
problem. 

• parsing efficiency. Parsing with a large, wide-coverage grammar is typically not effi- 

cient. Quickly extracting a small subgrammar module, and then using it to parse the 
text can reduce the running-time and space requirements. 

• context-based disambiguation. By having a larger grammar we achieve a better cover- 

age, but in the same time it becomes susceptible to ambiguities. Any NL is very am- 
biguous, and it is well known that humans use world-knowledge and contextual 
knowledge to do disambiguation. Extracting a subgrammar based on the text to be 
processed can be viewed as creating a context that can improve disambiguation. 

5.2 Recommender Systems Using ELEM2 

Recommender systems suggest information sources, products, services, etc., to users 
based on learning from examples of their preferences, dislikes, etc. There are two 
predominant methodologies employed in such systems. Collaborative (social) filter- 
ing methods base recommendations on other users preferences, e.g., when you order 
books from Amazon.com, the recommender system may detect other customers who 
ordered the same books and determine other orders placed by these customers to then 
enquire whether you may also be interested in acquiring similar material. Content- 
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based methods use information about the item ordered/specified in order to make 
further suggestions to the user. Advantages of content-based methods include the 
ability to recommend previously unrelated items to users with unique interests and to 
also provide explanations for recommendations. 

For collaborative (social) filtering, we plan to merge information sources to permit 
more fine-grained analysis and subsequent recommendations. For example, use of the 
Statistics Canada database on wealth demographics in Canada, which they categorize 
from richer to poorer by postal code, could conceivably recommend products/services 
based not only on social preference but also by wealth demographics at the same time. 

We especially wish to develop content-based methods since this will provide a 
new application for ELEM2. Content-based recommender systems provides another 
unique application for embedded ELEM2. Briefly, a set of documents (web pages, 
newsgroup messages, etc.) would have information extracted from an information 
extraction (word extraction) phase to develop a set of examples. We randomly select a 
set of examples and choose a subset of these examples from which we determine from 
a user, positive and negative examples. These positive and negative examples serve as 
a training set for the user. We apply ELEM2 rule induction process to extract a 
“user’s profile” and then rank the rest of the examples accordingly. The top ranked 
examples then serve as a list of items for recommendation. 

5.3 Agents and Agent Architectures 

The internet is a large, distributed, and heterogeneous source of information primarily 
consisting of on-line World Wide Web documents. It is perceived through a set of ap- 
plications based on the point-to-point communication links provided by the TCP/IP 
protocol. Many applications frequently end up with the problem that we want to find a 
relevant document, relevant item, or, generally, a relevant point in the information 
space consisting of Telnet sites, news groups, news group postings, FTP (File Trans- 
fer Protocol) sites, and WWW documents (pages, movies, radio broadcasts). How can 
we find out if someone has an e-mail address and how can we find that address? 
Finding interesting mailing lists is a still better example. 

The internet can be imagined as a low-level structure activated with considerable 
manual (human) participation. Such an intelligence-assuming environment requires 
computational intelligence management techniques. The most obvious example is a 
simple Web page. If we want to automatically use its content in a fashion more so- 
phisticated than collecting keywords, or collecting embedded links for further naviga- 
tion, then the most flexible, robust, and appropriate way to do this is to understand 
some of its content and to reason about it. This is the realm of computational intelli- 
gence. 

“Agent” has become a computational intelligence term, and a frequent buzzword 
having a wide range of definitions. Nevertheless, there are some common characteris- 
tics that describe an agent. An agent is anything that can be viewed as perceiving its 
environment through sensors and acting upon that environment. Furthermore, the de- 
velopment of multi-agent systems is based on work in two areas artificial intelligence 
and distributed systems. 

The combination of NL processing and multi agent system's is still quite novel 
and often the terms are used independently. Consider the use of NL processing for in- 
formation retrieval (IR) over the internet. This is an attempt to match the meaning of 
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the user’s query to the meaning of retrieved documents. Since this approach relies on 
higher levels of NL processing, it is difficult to implement. Issues include deciding 
what is a concept, how to extract concepts from NL texts, and how to do concept 
matching. The inefficiency of existing NL processing systems is a major obstacle to 
using them in IR. If we want to use an NLP system to analyze the documents in a 
large document collection, it has to be efficient and robust to be useful in practice. 

A positive approach is to implement distributed NL processing so that the proc- 
essing cost is widely distributed in the same way as are internet resources. Multi agent 
systems are appropriate for this task, see [16]. 



6. Concluding Remarks 

Web intelligence requires further research and development into the technologies dis- 
cussed above and other technologies as well. Adapting existing computational intelli- 
gence solutions may not always be appropriate for web intelligence for a number of 
reasons, e.g., the magnitude of information available on the internet and the additional 
requirements for speedy processing. Computational intelligence solutions which may 
be adapted must incorporate a more robust notion of learning in order for these solu- 
tions to scale to the web, in order for these solutions to adapt to individual user re- 
quirements, and in order for these solutions to personalize interfaces. 

We have only briefly touched on a few, albeit important, issues that will be the 
mainstay of web intelligence in the near term future. Users will demand access to the 
internet that is simple (multimodal interfaces), with language/speech capabilities - 
both comprehension and, when needed, translation - and personalized (multi agent 
architectures) internet use which “learns”. 

How soon might we expect to see breakthroughs? One way of considering this 
question is to recognize that research progress is highly incremental, thus, we are 
seeing progress every day. I, for one, have great hopes for the future of web intelli- 
gence. 
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Abstract. This presentation will discuss techniques for incorporating 
“guidebots,” or animated pedagogical agents, into Web-based learning 
environments. Guidebots help keep learning on track; they offer students 
advice and guidance as appropriate in order to get the most out on-line 
learning experiences. Guidebots build on research in intelligent tutoring 
systems, but go further by engaging the learner in natural face-to-face 
interaction. Guidebots can stand on the side and discuss learning ob- 
jectives with the learners; they also can work together with learners as 
teammates, and can play roles within interactive educational stories. 
They help bring the aesthetics of animated entertainment to interactive 
educational experiences. This talk will present recent developments in 
guidebot technology, and outline challenges for current research. 
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Abstract. Ontological engineering as a key technology of the next generation 
knowledge processing is discussed. After a brief introduction to ontological 
engineering with my speculation about its potential contribution, three major 
results of the practice of ontological engineering in my lab are presented. Then, 
paradigm shift in information processing is discussed followed by a future 
directions in the Web intelligence context. 



1 Introduction 

In AI research history, we can identify two types of research. One is "Form-oriented 
research" and the other is "Content-oriented research". The former investigates formal 
topics like logic, knowledge representation, search, etc. and the latter content of 
knowledge. Apparently, the former has dominated AI research to date. Recently, 
however, "Content-oriented research" has attracted considerable attention because a 
lot of real-world problems to solve such as knowledge sharing, facilitation of agent 
communication, meta-data, semantic web, large-scale knowledge bases, etc. require 
not only advanced formalisms but also sophisticated treatment of the content of 
knowledge before it is put into a formalism. 

Formal theories such as predicate logic provide us with a powerful tool to 
guarantee sound reasoning and thinking. It even enables us to discuss the limit of our 
reasoning in a principled way. However, it cannot answer any of the questions such as 
what knowledge we should prepare for solving the problems given, how to scale up 
the knowledge bases, how to reuse and share the knowledge, how to manage 
knowledge and so on. In other words, we cannot say it has provided us with 
something valuable to solve real-world problems. 

In expert system community, the knowledge principle [Feigenbaum, 77] proposed 
by Feigenbaum has been accepted and a lot of development has been carried out with 
a deep appreciation of the principle, since it is to the point in the sense that he stressed 
the importance of accumulation of knowledge rather than formal reasoning or logic. 
This has been proved by the success of the expert system development and a lot of 
research activities have been done under the flag of "knowledge engineering’’. 
However, the author is not claiming the so-called rule-base technology is what we 
need for future knowledge processing. Rather, in order to adapt to the rapid change of 
the situation, treatment of knowledge should be in-depth analyzed. Advanced 
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knowledge processing technology should cope with various knowledge sources and 
elicit, transform, organize, and translate knowledge to enable the agents to utilize it. 

Although importance of such ''Content-oriented research" has been gradually 
recognized these days, we do not have sophisticated methodologies for content- 
oriented research yet. In spite of much effort devoted to such research, major results 
were only development of KBs. We could identify the reasons for this as follows: 

a) It tends to be ad-hoc, and 

b) It does not have a methodology which enables knowledge to accumulate. 

It is necessary to overcome these difficulties in order to establish the content- 
oriented research or content technology. Ontological Engineering has been proposed 
for that purpose. It is a research methodology which gives us design rationale of a 
knowledge base, kernel conceptualization of the world of interest, semantic 
constraints of concepts together with sophisticated theories and technologies enabling 
accumulation of knowledge which is dispensable for knowledge processing in the real 
world. The author believes knowledge management essentially needs content-oriented 
research. It should be more than information retrieval. The content technology should 
be more sophisticated and powerful to realize the true knowledge management. 

The objective of this paper is to discuss how ontological engineering[Mizoguchi 
97] has emerged and how it will contribute to the future knowledge processing 
together with a brief history of the author’s research activities on those topics. 



2 What Is an Ontology and What Is Ontological Engineering? 

Ontological engineering is a successor of knowledge engineering which has been 
considered as a technology for building knowledge-intensive systems. Although 
knowledge engineering has contributed to eliciting expertise, organizing it into a 
computational structure, and building knowledge bases, AI researchers have noticed 
the necessity of a more robust and theoretically sound engineering which enables 
knowledge sharing/reuse and formulation of the problem solving process itself. 
Knowledge engineering technology has thus developed into “ontological engineering” 
where “ontology” is the key concept to investigate. Roughly speaking, ontology 
consists of task ontologyfMizoguchi 95a] which characterizes the computational 
architecture of a knowledge-based system which performs a task and domain 
ontology which characterizes the domain knowledge where the task is performed. By 
a task, we mean a problem solving process like diagnosis, monitoring, scheduling, 
design, and so on. The idea of task ontology which serves as a theory of 
vocabulary/concepts used as building blocks for knowledge-based systems might 
provide us with an effective methodology and vocabulary for both analyzing and 
synthesizing knowledge-based systems. 

An ontology is understood to serve as a kernel theory and building blocks for 
content-oriented research. Definitions of an ontology are presented below: 

a) In philosophy, it means theory of existence. It tries to explain what exists in 
the world and how the world is configured by introducing a system of critical 
categories to account for things and their intrinsic relations. 

b) From an AI point of view, an ontology is defined as “explicit specification of 
conceptualization” [Gruber ]. 
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c) From a knowledge-based systems point of view, it is defined as “a theory 
(system) of concepts/vocabulary used as building blocks of information 
processing systems” [Mizoguchi 95a]. 

d) Another definition [Gruber]: Ontologies are agreements about shared 
conceptualizations. Shared conceptualizations include conceptual frameworks 
for modeling domain knowledge; content-specific protocols for communic- 
ation among inter-operating agents; and agreements about the representation 
of particular domain theories. In the knowledge sharing context, ontologies 
are specified in the form of definitions of representational vocabulary. 

e) A compositional definition: An ontology consists of concepts with 
definitions, hierarchical organization of them, relations among them (more 
than is-a and part-of), and axioms to formalize the definitions and relations. 

f) Yet another definition: An ontology is an explicit specification of objects and 
relations in the target world intended to share in a community and to use for 
building a model of the target world. 

Why ontology instead of knowledge? Knowledge is domain-dependent, and hence 
knowledge engineering which directly investigates such knowledge has been 
suffering from rather serious difficulties, such as domain-specificity and diversity. 
Further, much of the knowledge dealt with in expert systems has been heuristics that 
domain experts have, which makes knowledge manipulation more difficult. However, 
in ontological engineering, we investigate knowledge in terms of its origin and 
elements from which knowledge is constructed. An ontology reflects what exists out 
there in the world of interest or represents what we should think exists there. An 
ontology is essentially designed to be objective and shared by many people. 
Hierarchical structure of concepts and decomposability of knowledge enable us to 
identify portions of concepts sharable among people. Exploitation of such 
characteristics makes it possible to avoid the difficulties knowledge engineering has 
faced with. The following is a list of the merits we can enjoy from an ontology: 

a) A common vocabulary. The description of the target world needs a vocabulary 
agreed among people involved. 

b) Explication of what has been often left implicit. Any knowledge base built is 
based on a conceptualization possessed by the builder and is usually implicit. 
An ontology is an explication of the very implicit knowledge. Such an explicit 
representation of assumptions and conceptualization is more than a simple 
explication. Its contribution to knowledge reuse and sharing is more than 
expectation considering that the implicitness has been one of the crucial 
causes of preventing knowledge sharing and reuse. 

c) Systematization of knowledge. Knowledge systematization requires well- 
established vocabulary/concepts in terms of which people describe 
phenomena, theories and target things under consideration. An ontology thus 
contributes to providing a backbone for the systematization of knowledge. 

d) Standardization. The common vocabulary and knowledge systematization 
bring us more or less standardized terms/concepts. Standardization has to be 
taken not as restriction of free exploration of research mind but as a minimum 
set of shared terms/concepts among human and computer agents who can 
communicate with each other thanks to them. 
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e) Meta-model functionality. A model is usually built in the computer as an 
abstraction of the real target. And, an ontology provides us with concepts and 
relations among them which are used as building blocks of the model. Thus, 
an ontology specifies the models to build by giving guidelines and constraints 
which should be satisfied. This function is viewed as that at the metalevel. 
This functionality suggests us the possibility of an “ontology-aware” 
authoring tool which can be very intelligent in the sense that it knows what 
model it is going to help authors build. 



3 Some Experiences in Ontological Engineering 

3.1 Functional Ontology and Knowledge Systematization 

The first topic is on systematization functional knowledge in computer-aided 
design(CAD)[Mizoguchi 00a]. Knowledge systematization is indeed a topic of 
content-oriented research and is not that of a knowledge representation such as 
production rule, frame or semantic network. Although knowledge representation tells 
us how to represent knowledge, it is not enough for our purpose, since what is 
necessary is something we need before the stage of knowledge representation, that is, 
knowledge organized in an appropriate structure with appropriate vocabulary. This is 
what the next generation knowledge base building needs, since it should be principled 
in the sense that it is based on well-structured vocabulary with an explicit 
conceptualization of the assumptions. This nicely suggests ontological engineering is 
promising for the purpose of our enterprise. 

While any scientific activity which has been done to date is, of course, a kind of 
knowledge systematization, it has been mainly done in terms of analytical formulae 
with analytical/quantitative treatment. As a default, the systematization is intended for 
human interpretation. Our knowledge systematization adopts another way, that is, 
ontological engineering to enable people to build knowledge bases on the computer as 
a result of knowledge systematization. The philosophy behind our enterprise is that 
ontological engineering provides us with the basis on which we can build knowledge 
and with computer-interpretable vocabulary in terms of which we can describe 
knowledge systematically in a computer-understandable manner. 

By building a framework for knowledge systematization using ontological 
engineering, we mean identifying a set of backbone concepts with machine 
understandable description in terms of which we can describe and organize design 
knowledge for use across multiple domains. The system of concepts is organized as 
layered ontologies as is seen in Figure 1 . 

3.1.1 Functional Modeling 

No one would disagree that the concept of function should be treated as a first class 
category in design knowledge organization. That is, function is an important member 
of a top-level ontology of design world. One of the key claims of our knowledge 
systematization is that the concept of function should be defined independently of an 
object that can possess it and of its realization method. If functions are defined 
depending on objects and their realization, few functions are reused in different 
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domains. As is well understood, innovative design can be facilitated by flexible 
application of knowledge or ideas across domains. 

Functional representation has been extensively investigated to date [Sasajima 95] 
[Chandrasekaran 00] and a lot of functional representation languages are proposed 
with sample descriptions of functions of devices. However, because it is not well 
understood how to organize functional knowledge in what principle in terms of what 
concepts, most of the representation are ad-hoc and lack generality and consistency, 
which prevents knowledge from being shared. One of the major causes of the lack of 
consistency is the difference between the ways of how to capture the target world. For 
example, let us take the function of a super heater of a power plant, to heat steam and 
that of cam of cam&shaft pair, to push up the shaft. The former is concerned with 
something that comes in and goes out of the device but the latter with the other device 
that cannot be either input or output of the device. This clearly shows the fact that 
there is a difference in how to view a function according to the domain. The 
difference will be one of the causes of inconsistency in functional representation and 
non-interoperability of the knowledge when functional knowledge from different 
domains is put into a knowledge base. 

The above observation shows that we need a framework which provides us with a 
viewpoint to guide the modeling process of artifacts as well as primitive concepts in 
terms of which functional knowledge is described in order to come up with consistent 
and sharable knowledge. 

3.1.2 Hierarchy of Functional Knowledge and Ontology 

Figure 1 shows a hierarchy of functional knowledge built on top of fundamental 
ontologies. The lower layer knowledge is in, the more basic. Basically, knowledge in 
a certain layer is described in terms of the concepts in the lower layer. Top-level 
ontology defines and provides very basic concepts such as time, state, process and so 
on. Causal ontology specifies actions and causality against teleology. Physical world 
ontology specifies 3D space and entity to give axiomatic physical world with a state- 
based modeling reflecting a special world of design in which an entity(artifact) is 
created from nothing. These two ontologies contribute to “Symbol grounding’’ of 
higher-level concepts, that is, functional concepts. On top of these three, process 
ontology is introduced to specify natural processes or phenomena. Every device 
utilizes several natural phenomena to realize its functions. 

Device ontology imposes a frame or viewpoint on an event to introduce a more 
engineering perspective. That is, it introduces the concepts of a black box equipped 
with input and output ports. Device ontology defines fundamental roles such as agent, 
object, conduit and medium. Although process ontology is more fundamental than 
device ontology, there are some cases where process ontology is directly employed to 
model real world events/phenomena instead of device ontology. Typical cases are 
found in modelling chemical processes for which device ontology is not appropriate. 
In summary, five ontologies(Top-level, causality, physical world, process and device 
ontologies) collectively work as a substrate on which we can build consistent 
knowledge in layers. 

Functional concept ontology specifies functional concepts as an instance of 
function defined in device ontology. The definitions are scarcely depends on the 
device, the domain or the way of its implementation so that they are very general 
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and usable in a wide range of areas. Theories and principles of physics and abstract 
part library also belong to this class of knowledge called general concept layer. 
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Fig. 1 . Hierarchy of ontology and knowledge. 



Functional achievement way knowledge is such knowledge that represents various 
ways of achievement of a function. This knowledge is about how{ in what way) a 
function is achieved, whereas the functional concept is about what the function is 
going to achieve. In other words, the former is formulated in terms of whole-part 
relation and the latter in terms of is-a relation. Although functional achievement way 
knowledge looks similar to functional decomposition like that discussed in [Pahl 88], 
the former is much richer than the latter in that it consists of four kinds of hierarchies 
of different roles and principles(7.v-« hierarchy, attribute tree, functional 
decomposition tree and general functional decomposition tree). The inherent structure 
of such knowledge is organized in an is-a hierarchy from which the other three 
structures are derived according to the requirement. The is-a structure is carefully 
designed identifying inherent property of each way to make it sharable and applicable 
across domains. One of the key issues in knowledge organization is clear and 
consistent differentiation of is-a relation from other relations such as part-of, is- 
achieved-by, etc. keeping what is the inherent property of the target thing in mind. 

3.1.3 Roles and Effects of Functional Ontology 

The extended device ontology views an artifact as something that inputs, process and 
outputs objects. The object is something processed by the device during it goes 
through a device and hence it never be another device that cannot go through a 
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device. This ontology imposes a proper viewpoint from which one can successfully 
model a mechanical system in a way consistent with those models of engineering 
artifacts produced in other domains. It is not an easy task to build models of a lot of 
artifacts in a consistent way. “A gear pair changes torque”, “A cam shrinks a spring” 
and “A cam pushes up a rod” are inconsistent with each other in the hidden 
computational models. While the first one is based on the extended device ontology, 
the latter two are based on a different ontology, say, inter-device operation ontology. 
The organization of knowledge including these models will lose consistency. 

The extended device ontology allows us to build interoperable models and 
provides us with a guideline for modelling process by its role-assignment 
functionality which is the very source of consistency in functional knowledge 
organization. For example, the concept of a conduit helps us consistently recognize 
devices by taking it as the boundary between the devices. In the mechanical system 
domain, a shaft and a wire, which play the role of conduit in the mechanism level, 
enable us to identify each mechanism composed of mechanical elements. Models 
designed based on the extended device ontology has a high composability thanks to 
its localized description, that is, its independence of neighboring devices that are 
connected to each other only through attributes of an object. On the contrary, 
composability of inter-device operation ontology is low due to its high dependence on 
neighboring devices. 

3.1.4 Use of Functional Concept Ontology 

Functional concept ontology provides us with necessary and sufficient operational 
terms used for representing functional knowledge/model together with constraints to 
be satisfied by them. The following is the list of our work on use of the ontology 
through to evaluate it. All the activities are of new type and different from the 
conventional knowledge base technology. No problem solving knowledge is treated. 
Instead, objective and fundamental knowledge is analyzed and modeled to enable a 
knowledge-based system to articulate the domain and hence to in-depth understand 
the fundamental knowledge to provide useful knowledge with designers. This is what 
we need intelligent functional knowledge management in CAD community. 

a) Explanation generation at the functional levelfSasajima 95]. 

b) Functional model description of specific artifacts of many kinds [Kitamura 99a] 

c) Description of ways of functional achievement: 104 ways for 26 functions 
found in five different artifacts(a washing machine, a printing device, slicing 
machines for ingot of semiconductors (using wire or rotating blade), and an 
etching device) 

d) Specification of the inference space for functional reasoning[Kitamura 99b,00], 

e) Way knowledge server for designer support. 

3.2 Plant Ontology for Multi-agent Plant Operator Support System 

This section describes an activity of ontology construction and its deployment in Oil- 
refinery plant which has been done under the umbrella of Human-Media Project for 
five years, which is a MITI(Japanese Ministry of International Trade and Industries) 
funded national project, is intended to invent an innovative media technology for 
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happier human life in the coming information society. Our ontology construction 
activities have been done in the project named “Development of a human interface for 
the next generation plant operation” running as a subproject of Human Media 
project[Mizoguchi 99, 00b]. 

The interface for oil-refinery plant operation has been developed intended to 
establish a sophisticated technology for advanced interface for plant operators and 
consists of Interface agent: IA, Virtual plant agent: VPA, Semantic information 
presentation agent: SIA, Ontology server: OS and Distributed collaboration 
infrastructure: DCI. The last two are mainly for issues related to system building, 
while the first three are related directly to interface issues. OS has been developed 
employing ontological engineering. 

3.2.1 The Role of a Plant Ontology 

Any intelligent system needs a considerable amount of domain knowledge to be 
useful in a domain. The amount of knowledge necessary often goes large, which 
sometimes causes difficulties in the initial construction and maintenance phases. As 
described above, one of the methods we adopted to cope with such problems is 
ontological engineering. Roughly speaking, the essential contribution of the plant 
ontology is making shared commitment to the target plant explicit, and hence 
terminology is standardized within the community of agents. By agents, we also mean 
human agents, operators, to share such a fundamental understanding about the plant. 
This enables the system to communicate with operators using the terms stored in 
Ontology server: OS. It is the second major role of OS in the current implementation 
of the interface system which is discussed below. 

In message generation, we need to pay maximal attention to word selection to 
make operators’ cognitive load minimum in message understanding. After an 
intensive interview with domain experts, we found human operators use different 
terms to denote the same thing depending on context. When we first noticed this fact, 
domain experts apologized for this seemingly random fluctuation of word usage, 
since they did not know the reason why they use terms that way and they were used to 
collaboration with computer engineers who do not like neat adaptation and tend to 
compel their idea of “this is what a computer can do, so accept it”. They kindly 
declared that they would soon determine a unique label for each thing. But, we were 
different from such computer engineers. Instead of accepting their proposal, we 
carefully analyzed the way of their word usage and finally came up with that it is not 
random except a few cases. Many of the wording have good justifications which have 
to be taken care of in the message generation. 

The reasons why we employed distributed collaboration architecture with multiple 
agents include making the whole system robust and easy to maintain. As is well 
known, however, these merits are not free. We need a well-designed vocabulary for 
describing message content as well as a powerful negotiation protocol. Although the 
latter is of importance, it is out of the scope of this article. DCI is responsible for 
enabling collaborative problem solving by multiple agents with the help of OS. It is 
one of the key factors that domain-dependent knowledge be isolated in OS so that 
DCI can be as general as possible. 
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3.2.2 Plant Ontology 

The plant ontology we built consists of several hierarchical organizations of concepts 
such as operation task, plant components, plant objects, basic attributes and ordinary 
attribute. Because of the space limitation, only domain ontology is discussed here. 
The key issue in the design of an ontology is clear distinction essential categories 
from view-dependent concepts. 

There exist two major things in the plant domain: Plant components(devices) and 
plant objects to be processed by the devices. Domain concepts also have role concepts 
like task ontology does. To say precisely, many of the domain concepts are role 
concepts. The first things we have to do when designing a domain ontology is 
discrimination of roles concepts from essential categories (or basic concepts), i.e., 
view- or context-independent concepts. Let us first take plant object. The top-level 
categories of plant object are view-independent object and view-dependent object. 
The former includes LP gas, gasoline, naphtha, etc. which are categories persistent in 
any situation. The latter includes tower-head ingredient, liquid, distillate, input, 
intermediate product, raw material, fuel, etc. All are view- or context-dependent. 
The major task needed was categorization of such dependency. Figure 2 shows a 
portion of plant object is-a hierarchy. The major categories of view-dependent plant 
object are state-dependent, location-dependent, history-dependent and role- 
dependent objects, state-dependent objects has inherent state-dependent and relative 




Fig. 2. A Portion of Plant Object is-a Hierarchy in the Plant Domain Ontology 

state-dependent objects as its sub-concepts. The former includes liquid, gas, 
superheating steam etc. and the latter low temperature ingredient, low boiling point 
ingredient, etc. 
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Attribute also needs careful treatment. Most of the attributes people think so are 
not true attribute but role attribute. Let us take an example of height. It is a role 
attribute whose basic attribute is length. Height, depth, width and distance are role 
attributes. Just like a man is called a husband when he has got married. The true 
attribute is called basic attribute. Examples of basic attribute include length, area, 
mass, temperature, pressure, volt, etc. Role attribute includes height, depth, input 
pressure, maximum weight, area of cross section, etc. Needless to say, these attributes 
are also decomposed into several sub-concepts. 

We finally built an ontology which contains about 400 concepts which are 
approved by the domain experts and the coverage is around the normal pressure 
fractionator of a full-scale refinery plant. The model of the target refinery plant is 
built by instantiating the appropriate concepts in the ontology and connect them. The 
number of instances generated is about 2000. The ontology and the model of the 
target plant is stored in OS and served to other agents in the total prototype system. 
Evaluation was done by experts and we have got a favorable result. 

3.3 Hozo: An Integrated Ontology Development/Use Environment 

Building an ontology requires a clear understanding of what can be concepts with 
what relations to others. An ontology thus focuses on “concepts” themselves rather 
than “representation” of them. Although several systems for building ontologies have 
been developed to date, they were not based on adequate consideration of an 
ontological theory which is why most of them are yet another KR languages. We 
believe that a fundamental consideration of ontological theories is needed to develop 
an environment for developing ontologies. We discuss mainly “role concept” and 
“relationship”, and consider how these ontologically important concepts should be 
treated in our environment. On the basis of the consideration we have designed and 
have developed an environment for building and using ontologies, named “Hozo”. 

“Hozo”[Kozaki 00] is composed of “Ontology Editor”, “Onto-Studio” and 
“Ontology Server”(See Figure 3). The Ontology Editor provides users with a 
graphical interface, through which they can browse, build and modify ontologies by 
simple mouse operations. This system manages attributes between concepts organized 
in an is-a hierarchy. The Onto-Studio is based on a method of building ontologies, 
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named AFM (Activity-First Method) [Mizoguchi 95b], and it helps users design an 
ontology from technical documents. The Ontology Server manages the ontologies and 
models. Ontology Editor in Hozo have been extensively used in my lab for four years 
and it was used to develop the plant ontology described above together with Ontology 
Server. 

Because FIozo is implemented in Java and the ontology editor is an applet, it can 
work as a client through Internet. Hozo manages ontologies and models for each 
developer. For each ontology in Hozo, only its author can modify it, and the other 
users can only read and copy those developed by others. It lets share ontologies 
among users without explicit version control. 

Models are built by choosing and instantiating concepts in the ontology and by 
connecting the instances. Hozo also checks the consistency of the model using the 
axioms defined in the ontology. The ontology and the resulting model are available in 
different formats (Lisp, Text, XML/DTD) that make them portable and reusable. The 
axiom contains constraints which part-concepts or attributes should satisfy, and 
relations among the part-concepts. For example, these are constraints on the part- 
concept such as “any teachers must have a teaching certificate” in a “school”, and 
“the size of wheels are from 10 inch to 30 inch” in a “bicycle”. Another example is a 
constraint on the relation such as there must be a connection relation between a wheel 
and a frame in a “bicycle”. Figure 4 depicts a snapshot of plant ontology definition 
using Ontology editor where Flow controller as a subclass of controller is defined. A 
string above a blank rectangle represents a role name of an instance which should be 
put in the rectangle followed by colon and an dark rectangle which is a class 
constraints of the instance. 




Fig. 4. A snapshot of the plant ontology definition. 
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4 Paradigm Shift 

At the expert system time, people’s expectation was to build a stand-alone problem 
solving system which has a knowledge base of a domain expert’s heuristics to 
perform a specific task with similar or higher performance than that of the expert. 
That is, the main focus was put on dealing with subjective and specific knowledge for 
problem solving. However, situation has been changing. Most of the salient activities 
such as Web document search in the internet, Electronic commerce(EC), Electronic 
Data Interchange(EDI), Knowledge management(KM), STEP, XML tag design, etc. 
do require almost opposite characteristics to knowledge, that is, objective, general and 
sharable knowledge which is not necessarily tuned to problem solving. 

We can summarize the major trends in the following four kinds of paradigm shifts 
in computer technology: 

a) From Computer-centered to Human-centered 

b) From Processing-centered to Information-centered 

c) From Form-oriented to Content-oriented 

d) From Centralized control to Distributed control 

The first is based on deep reflection on the long history of computer-centered 
research which have never been good in human-computer interaction aspects, since 
such technology forces a human to approach machines/sy stems or allow, at best, 
addition of an ad-hoc interface on top of each system. We need to change 
paradigmatically to come up with an innovative and essentially better human interface 
technology. The concepts of Human-centered and Information-centered technologies 
are key concepts of such an enterprise. The true man-machine systems which are what 
we need in the coming information age do require an open architecture involving 
humans who need computers to help them facilitate their daily activities. 

The second reflects what we learned from expert system development in which 
processing-centered approach has been dominant. That is, an expert system tries to 
solve a problem instead a human. The problem was it is not what people really need. 
What people need is an intelligent life-long partner who helps them in many aspects 
to amplify their capability. 

The third is a topic related to artificial intelligence(AI) research where so-called 
form-oriented basic research has been extensively conducted. It has been trivial from 
the beginning that no intelligent system can function without a reasonable amount of 
knowledge. Nevertheless, form-oriented research has dominated AI research. 
Content-related activities are mainly knowledge base construction. Although huge 
amount of such activities have been conducted to date, they are “development” rather 
than “Research”, since they are ad hoc, heavily domain-specific and hardly 
accumulatable. We need content-oriented “Research” to make an essential 
contribution to intelligent system building. 

The fourth is related to system architecture issues. It is an infrastructure of building 
a large-scale robust system which is often difficult to build and maintain. Typical 
distributed control systems include a multi-agent system in which agents collaborate 
with each other without a priori specification of interaction between them unlike the 
conventional centralized control systems. This paradigm of system design makes it 
easier to build a large-scale system provided a powerful negotiation protocol. 
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Web Intelligence, or Web-based intelligent systems should be something like a 
partner of a human by being compliant with the above paradigm shift. 



5 Web Intelligence and Ontological Engineering 

When we accept the paradigm shift, WWW technology is going to bring us a kind of 
revolution to a knowledge base building. Conventionally, a knowledge base has been 
something to design and build upon request. However, WWW and semantic web 
technologies facilitate automatic building of knowledge resources so that a huge 
knowledge base virtually exists out there, and hence the problem to solve has become 
not to build a knowledge base from scratch but to collect appropriate web pages out of 
already existing WWW knowledge resources, to reorganize and to merge them. 
Enabling technologies are XML, RDF(S) and DAML+Oil in Semantic Web[SW], 
Semantic web has been devised to make web pages machine interpretable and 
hence to change the WWW from flood of irrelevant information to a huge useful 
knowledge source. The goal is good. The problem, however, is how we can make use 
of the web pages retrieved which still include irrelevant pages and need more 
elaboration for use of specific purposes. Although ontology will be extensively used 
in Semantic web activities, the major use is limited to exploiting super-sub relation 
between concepts for the purpose of the intelligent retrieval of relevant web pages. It 
is true that retrieval is one of the key technologies in Semantic web. We look further 
ahead and envision critical contribution of ontological engineering to the next 
generation knowledge processing. That is, systematic development and sophisticated 
processing of semantic tags which will definitely overflow all over the world and will 
need sophisticated ontologies that are something more than a hierarchical 
organization of concepts to process them appropriately. 



6 Conclusions 

We have discussed the ontological engineering, its successful applications and future 
directions to go mostly from the application-oriented viewpoint. Academic 
perspectives of ontological engineering include fundamental theories and common 
top-level ontology. Both perspectives suggest ontological engineering will play a 
critical role as content technology for the next generation knowledge processing. 
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Abstract. The subject of this talk is the use of ideas from social network 
theory on the web and in the enterprise. We begin by reviewing a number 
of empirical observations on the web, concerning various measures of 
popularity of websites. Next, we describe how these observations can 
be used in algorithms for searching and mining on the web. We develop 
mathematical models for these phenomena. Finally, we discuss how these 
ideas and phenomena change as one goes from the public web to the 
confines of enterprises. 



1 Overview 

The internet, and its particular manifestation as the World-Wide Web, have led 
to a rich interplay between hundreds of millions of people and billions of pages of 
web content. This represents a social network unprecedented in history. We begin 
by reviewing several empirical observations on this network, followed by a listing 
of web search and mining algorithms built on the network. We give pointers to 
models of the web as a random graph, and finish with a brief discussion of how 
these phenomena differ in enterprises. 

2 Structure in the Web Graph 

The web may be viewed as a directed graph in which each vertex is a static html 
web page, and each edge is a hyperlink from one web page to another. Current 
estimates suggest that this graph has well over a billion vertices, and an average 
degree of about 7. We begin by reviewing a series of empirical observations on 
the web graph m- 

A striking, fairly consistent observation on the web is the distribution of 
vertex in-degrees: they follow a power law ra, with the fraction of web pages 
having in-degree k being proportional to k ~ 21 across a variety of measurements 
by various researchers. Other power laws observed on the social network of the 
web include the frequency of visits to websites. Clearly these statistics would not 
be observed if the actions of various users (which sites they visit, which pages 
they link to) are statistically independent. These and other phenomena (e.g., 
the connectivity structure of the web 0) demand the development of new graph 
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models. Work to date in this area EES seems to suggest that random copying 
is a phenomenon that helps explain some of the observed statistics. Another 
important aspect of the new models is that they capture an evolving graph in 
which new nodes and edges appear over time. 

3 Exploiting the Structure 

Perhaps the best know use of the link structure of the web is the google search 
engine j2j (www.google.com). The google engine uses the link structure of the 
web to induce a total ordering on the pages of the web, assigning to each page 
a number called the PageRank. A different approach to using link structure for 
web search is the HITS method of Kleinberg m- Here the ordering of pages 
is topic-dependant, and invokes a particular insight into the way web content 
creators behave: they congregate into groups of self-styled experts on a topic. 
Interestingly, this phenomenon is also captured by the new random graph models 
that embody random copying PI- These ideas have their origins in work from 
citation analysis lblSl9| . Finally, we mention that it is possible to enumerate 
these congregations of experts, mining the web for interesting topics El and 
other structures. 

4 Social Networks in the Enterprise 

The web has been a tremendously exciting and instructive test bed for experi- 
menting with ideas from social networks. However, most of the content on the 
web is, we believe, “low quality” - it consists primarily of unfiltered market- 
ing information and annotation. The truly valuable unstructured information 
resides inside enterprises (by “enterprise”, we mean not just profit-making cor- 
porations, but also governments, academic institutions, and so on). The very 
fact that these enterprises limit access to their information means they place a 
significant value on this information. How can we use ideas from social networks 
to these settings? 

Preliminary investigations suggest that a number of the ideas that are honed 
to the web will have to be adapted to the enterprise. For instance, the structure 
of hyperlinks in the enterprise is likely to be very different from the web: content 
creation in enterprises is typically less free-form and more templated, with a 
lot more navigational links and a lot fewer annotative/expressive links. As for 
connectivity, it appears that a large enterprise such as IBM may behave more 
or less like the web, but small enterprises are likely to be very different. We are 
performing more experiments on this front, and only a detailed study will reveal 
how the scope and scale of an enterprise affect its link structure. 

Beyond hyperlinks, there is a great deal more to the social network of an 
enterprise: what documents do employees tend to access, how are these related 
to their roles in the enterprise, and what navigational assistance can an enterprise 
portal derive from these patterns? The challenge ahead, we believe, is to tap into 
these patterns and use them to help search, navigate, classify and personalize 
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information access within the enterprise, much as we have had successes on these 
fronts in the web. 
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Abstract. This research deals with state-of-the-art novel ideas in high 
level visualization, understanding and interpretation of line-drawing 
images of 3D patterns, including articulated objects. A new structural 
approach using linear combination and fast two-pass parallel matching 
techniques is presented. It is aimed at learning, representing, visualizing, 
and interpreting 2D line drawings as 3D objects, with only very few 
learning samples. It solves one of the basic concerns in diffusion to- 
mography complexities, i.e. patterns can be reconstructed through fewer 
projections, and 3D objects can be recognized by a few learning samples 
views. It can also strengthen advantages of current key methods while 
overcome their drawbacks. Furthermore, it will be able to distinguish 
objects with very similar properties and is more accurate than other 
methods in the literature. In addition, an expsrimental system using 
JAVA and user-friendly interative platform has been established for 
testing large volume of image data in virtual environment on the web, 
for learning, and recognition. Several illustrative examples are demon- 
strated, including learning, recognizing, visualization and interpretation 
of 3D line drawing polyhedral objects. Finally, future research topics 
are outlined. 

Keywords: visualization and interpretation, 3D articulated objects, 
learning, pattern recognition, linear combination, diffusion tomography, 
finite representation, interactive learning, on-line virtual environment 



1 Introduction and Objectives 

Recently, there has been a growing interest in the research of 3D object recog- 
nition. Not only it is interesting and challenging, but also it can be applied 
to many fields, including industrial parts inspection, military target recognition, 
CAM/CAD engineering design, image understanding, and document image anal- 
ysis [5,11,18,22,33,41,44,45]. However, most work tackled rigid objects only, with 
large learning sample size, low speed, and can not distinguish similar patterns 
[2,25,44]. Recently, with rapid development of modern computer technologies, 
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there is an increasing interest and need to recognize articulated objects. Many 
3D objects including military targets fall into this category. In this research, a 
novel strategy is proposed aimed at faster, more accurate and needs fewer learn- 
ing samples for high level visualization, understanding, and interpretation of 3D 
articulated objects. 

A Brief Historical Background. Line drawing images, also known as “no- 
ble class of images” [31] in which many important and essential phenomena of 
3D objects hold in the real world, provide an effective and practical method 
to describe the 3D shape of an object. These drawings normally can be orga- 
nized into two or more distinct views, known as characteristic views or aspects 
in which no one view by itself is sufficient to completely characterize the object 
being represented [20]. One of the most challenging and difficult problems is : 
how can we visualize and interpret a line drawing image of 3D object? How a 3D 
object in various rotations and topological scalings can be properly represented 
and recognized? There have been several developments in engineering drawing 
interpretation over the past twenty years. Earlier work was reported in 1971 
[12]. Later, a bottom-up approach to interpretation was developed [42], and was 
extended to include objects with curved surfaces [29]. Recently, it has applied 
the bottom-up technique to interpret paper-based drawings [24] . An alternative 
approach was presented in [43], using a rule-based method to produce a wire 
frame interpretation, but only for simple polyhedral objects. In 1988, Nagendra 
and Gujar did a comprehensive survey on various methodologies in the litera- 
ture [23]. However, all of these methods appear to require a rigidly controlled 
drawing layout as there is no mechanism to determine relative viewing direc- 
tions between the various views. Also, there is no provision for any error in the 
alignment of the views or in the location of point features other than those that 
can be accommodated by a simple distance threshold. In 1984, a method was de- 
veloped for correcting misalignments in single-view drawings with a non-special 
view point; yet it does not appear to be extensible to multiple views with special 
view points [31]. 

Almost simultaneously, there have been various methods in computer vision 
and object recognition. An interesting approach was presented for recognizing 
3D objects using a set of 72 x 72 = 5184 sample views from learning [32]. Fur- 
ther effort has reduced it to 100 views [25] . More recently, a powerful new com- 
putational method called linear combinations (LC) using alignment techniques 
was developed at MIT [2,34,35]. It involves easy computations and needs fewer 
learning sample views, in contrast to other methods that required a huge set 
of learning samples even for describing and recognizing only one single object 
[8,9,14,17], and some other methods which require more complicated primitive 
structures such as ellipsoid and bent cylinder [10]. Yet, it has many limitations 
and may misrecognize invalid objects and reject valid ones. Such difficulties were 
overcome by another approach using parallel graph matching by Wang [39] with 
several explicit rules for learning and recognition of concave rigid objects with 
very few sample views. Other recent developments and surveys can be found in 
[4,7,24]. 
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Now we will tackle a more general problem involving articulated objects. This 
class of objects is of special interests and importance since it includes most of 
the industrial robots and man made factory tools as well as military targets. For 
example, in recognizing a tank, it will involve different views of the model and 
various status/range of how far each component, e.g. cannon can rotate. How 
can such process be automated by using computer to visualize, understand and 
interpret such images? Though interesting and important, yet recognizing such 
objects by computers is more difficult and challenging. An earlier attempt to 
tackle such problems was in [6] using symbolic reasoning in ACRONYM system. 
An interpretation tree approach to deal with 2-cl objects with rotating subparts 
was introduced in [16] and extended to handle 3D articulated objects such as 
staplers [15]. In [3], different aspects in this area were reviewed and generalized 
the generalized Hough Transform (GHT) to recognize single joint articulated 
objects such as pairs of scissors. Yet, it remains to be seen how to overcome 
the added limitations of the GHT method, as usually happens when an existing 
object recognition technique is extended to handle articulated objects [3,13]. 
The method proposed in [37,40] uses very few learning samples and can handle 
articulated objects but only works for wire-frames, and needs 3D coordinates of 
the objects. 

Our main goal is to present a new approach to overcome the above men- 
tioned drawbacks, while maintaining advantages of key methods, and a system 
is constructed using JAVA for testing various 3D articulated pbjects in virtual 
environment on the web [1,14,19,21,23,27,28,36,41-45]. 

2 Linear Combination Method (LC) 

Linear combination method is based on the observation that novel views of 
objects can be expressed as linear combination of the stored views (from learn- 
ing). It identifies objects by contracting custom-tailored templates from stored 
two-dimensional image models. The template-construction procedure just adds 
together weighted coordinate values from corresponding points in the stored 
two-dimensional image models. Here, a model is a representation in which 

* an image consists of a list of feature points observed in the image 

* the model consists of several images - minimally three for polyhedra. 

An unknown object is matched with a model by comparing the points in an 
image of the unknown object with a template-like collection of points produced 
from the model. In general, an unknown object can be arbitrarilly rotated, ar- 
bitrarilly translated and even arbitrarilly scaled relative to an arbitrary original 
position. From the basis graphic knowledge, an arbitrary rotation and transla- 
tion of an object transforms the coordinate value of any point on that object 
according to the following equations: 

Xg = r xx (6)X + r yx (9)Y + r zx (9)Z + t x 

^ 9 = r xy{@)X + r V y(Q)\ + r zy (9)Z + t y 

Zg = r xz (9)X + r yz {9)Y + r zz {9)Z + t z 
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where Tij{0) ( i,j=x,y,z) is the parameter that shows how much the i coor- 
dinate of a point, before rotation, contributes to the j coordinate of the same 
point after rotation, and f s (s=x,y,z) is the parameter that is determined by how 
much the object is translated. 

Based on S. Ullman’s concept that three images , each showing four corre- 
sponding vertexes, are almost enough to determine the vertexes’ relative posi- 
tions, therefore, at least three model images are needed and these three model 
images yield the following equations relating models and unknown object coor- 
dinate values to unrotated, untranslated coordinate value, x,y,z. 

Xi 1 = r xx (6i)X + r yx (9i)Y + r zx {6/)Z + t x (9\) 

Xi 2 = r xx (9 2 )X + r yx {9 2 )Y + r zx [9 2 )Z + t x {9 2 ) 

Xi 3 = r xx (9z)X + r yx (9/)Y + r zx {9/)Z + t x {9 3 ) 

X Io = r xx (9o)X + r yx (9 0 )Y + r zx {9/)Z + t x (9o) 

These equations can be viewed as four equations in four unknowns, X, Y, Z 
and Xj 0 , and can be solved to yield Xj 0 in term of Xj 4 , Xj 2 , X / 3 and a collection 
of four constraints, 

X Io = a x Xj 1 + p x Xj 2 + lx X l3 + S x 

where a x , f3 x , "f x and 5 X are the constrants required for x-coordinate- value 
prediction, each of which can be expressed in term of r s and t s . In order to 
determine the constraints value, a few corresponding points are needed, here 
there are four constraints, therefore, four equations are needed, furthermore, 
four feature points are required in every image. 

The four equations are described as follows: 



Xp 1 i 0 = a x Xp 1 i 1 + p x Xp 1 i 2 + 'j x Xp 1 i 3 + S x 
Xp 2 i 0 = ot x Xp 2ll + (3 x Xp 2 j 2 + 7 x Xp 2 j 3 + 5 X 
Xp 3 i 0 = a x Xp 3 i 1 + (3 x Xp 3l2 + -y x Xp 3 i 3 + S x 
Xp 4 i 0 = a x Xp i i 1 + p x X Pil2 + -y x Xp 4 j 3 + S x 

The solutions of a x , /3 x ,y x and 5 X values can be used to predict the x coordinate 
value of any point in the unknown image from the corresponding points in the 
three model images. Then these predicted values can be used to compare to the 
original x coordinate values in unknown image. If the difference between them 
is less than a certain threshold, these two points match with each other and if 
all feature points match, the conclusion can be made that the unknown image 
matches the image models. 

Similarly, we can build equations for y coordinate, by using same method as 
that used for x coordinate, producing another set of constrants: a y , (3 V , "f y and 
S y to predict the y coordinate value of any point in the unknown image. 

Each input image is partitioned to two portions, i.e. the main and articulated 
portion. The rationale is to use as few projections as possible for learning and 
recognition , use graph for pattern representation , and gain faster speed by 
parallel matching. Each portion needs projections from two directions for every 
characteristic view in learning. Only one view is needed for recognition, though. 
If each portion of an input is a linear combination of some object models, 
it is accepted, else rejected. Human interactions are involved in learning/ 
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recognition. In 2D space, the values of z axis of orthographic projection of a 3D 
image are all set to 0 or a constant. A linear combination (LC) scheme uses two 
images in the same characteristic view class to see if an input is acceptable. 

If the input image is an LC of two images in the model-base, then it 
is accepted, else rejected. In a way, it is equivalent to solving simultaneous 
equations of the following two vectors : 

x = ciiXi + a 2 yi + a 3 x 2 and y = 61X1 + & 2 yi + 6 3 x 2 
We therefore propose the following method for recognizing line drawings of 
3D objects. Learning and recognition. The recognition phase consists of : 
(i)extracting of articulated portions and (ii) determining its status, both based 
on the knowledge obtained from learning. 




Fig. 2.1 Interactive object recognition/understanding(in virtual environment on web) 




Fig. 2.2 An example of hough transform of an input image (3D articulated object) 
to line drawing and skeleton. In this article, smoothed and thinned lines such as those 
shown below will be used for illustrations. 
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Fig. 2.3 An example of articulated feature extraction 



3 Some Experimental Results 

Some of the examples and experimental results are given in this section. As 
it is mentioned before, the Linear Combination Method applies to both rigid 
objects, which include both convex and concave objects, and articulated objects 
with visible and invisible hinges. This will cover a large amount of objects in 
real world and that is why this method is so useful. 

The following gives two kinds of articulated objects and shows how the LCM 
works on them according to the experimental data. 

Articulated Objects 

The objects in fig. 3.1 are kinds of closets and their doors can be rotated 
along their hinges. The three model images in (i), (ii) and (iii) are set up by 
rotating both main part and articulated part separately. Fig 3.1(iv) shows the 
different view of the same object and the following data verify that it matches 
the model images. 

The match procedures on the main part are described as follows: 

The feature points on the main parts of the model images are as belows: 
Modell: (-1.90,0.40), (1.00,-0.20), (1.70,-1.10), (0.70,-2.10) 

Model2: (-2.00,0.60), (0.70,-0.50), (1.70,-1.20), (0.30,-2.20) 

Model3: (-2.00,0.70), (0.40,-0.70), (1.70,-1.30), (-0.00,-2.40) 

and the selected feature points on the main part of this unknown image are 



(-2.0, 0.8), (0.3, -0.8), (1.7, -1.3), (-0.2, -2.4), and (-2.3, -0.8). 
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Fig. 2.4 An example of pattern matching (learning and recognition) 



After the equations are solved, the constrain values are as follows: 
a* = -1.12, p x = 1.78,7a; = 0.31,4 = 0.05 
a y = -0.32, /3 v = 0.09, 7 y = 1.21,4 = °- 03 

the predicted x and y coordinate values are -2.4030 and -0.7106 and the 
absolute differences between the predicted value and the original one are 0.1030 
and 0.0894. If we select threshold to be 0.6, both the differences are less than 
threshold, so far the main part of this unknown image matches those of the 
model images. 

The match procedures on the articulated portion are described as follows: 
The feature points on the articulated portion of the model images are as 
follows: 

Modell: (-1.90,0.40), (1.00,0.30), (2.00,1.20), (-0.80,1.30) 

Model2: (-2.00,0.60), (0.70,0.60), (2.10,1.40), (-0.40,1.40) 

Model3: (-2.00,0.70), (0.40,0.80), (2.20,1.70), (-0.10,1.60) 

and The feature points on the articulated portion of this unknown image are 

(-2.0, 0.8), (0.1, 1.2), (2.2,21), (0.0, 1.7), and (-0.95,1.0). 

After calculation, the constrain values are as follows: 

a* = 0.62,4 = -2.00, = 2.38,4 = -0.07 
a y = 0.33,4 = 0-67, 7y = 1.00,4 = 1-60 

the predicted x and y coordinate values are -1.4835 and 1.2656 and the ab- 
solute differences between the predicted value and the original one are 0.5335 
and 0.2656 . If we select threshold to be 0.6, both the differences are less than 
threshold, so far the articulated portion of this unknown image also matches 
those of the model images. 
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Fig. 3.1 Articulated object recognition, (i), (ii) and (iii) are model images of the closet, 
(iv) is another view of the same closet, and (v) is the image of a different closet 



Therefore, the conclusion can be made that this unknown image matches the 
model images. 

Figure 3.1 (v) is a closet which has the same main part as that of model 
object, however, its door is different from that of the model one: it has a various 
size. Therefore, when this input image is recognized, the main part should match 
the models and the articulated portion should not. Hence, the entire image 
will not match model images. The followings are the recognization data which 
demonstrate the above statement. 

Because the main part of this unknown image is the same as that in (iv) 
and from the above calculation, we know that it matches the main parts of the 
model images. 

Next, let us focus on the articulated portion recognition. 

The feature points on the articulated portion of this unknown image are 
(-2.0, 0.8), (-1.0, 1.0), (1.1, 1.9), (0.0, 1.7), and (-1.5, 0.9). 

After calculation, the constrain values are as follows: 
a x = 0.26, /3 X = —1.68,7a, = 2.62, <5^, = —0.61 
a y = 0.50, /3y = -0.92, j y = 0.67, 6 y = 0.40 

the predicted x value is -0.0273 and the absolute difference between the pre- 
dicted value and the original one is 1.4727 . If we select threshold to be 0.6, both 
the difference is much greater than threshold. Since the articulated portion of 
this unknown image does not match that of the model images, therefore, one 
can cluclude this unknown image does not match the model images, as stated 
before. 
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4 Discussions, Conclusions, and Future Research 

Through our interactive on-line system of testing, it is observed that the ability 
to open the lid gives us the extra possibility to test the result of the recognition. 
As in Figure [4.1], the interactive web platform has two portions for correspon- 
dence points selection: Learning on left, and target object to be recognized on 
right. The upper two and lower left objects are learning samples from various 
angles of the same object, while the lower right object is the target to be tested. 
Figure [4.1(3)] shows computedr distance values against threshhold values, and 
the Figure[4.1(4)] shows their relations on curve. 




The lid can be opened every 5 degrees. So we can use it to create a set of 
continuously changed objects. It is interesting to see how the angle of opening is 
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Fig. 4.1 (1) Correspondence points selections for learning and recognition target input 
object image, (2) 3 learning samples against 1 input object image on lowever right cor- 
ner, (3) some computed values of distance against threshhold values, (4) their relation 
curves 



related with distance. As we can imagine, the distance is growing with the angle 
of opening getting larger. The distance itself is not comparable between different 
object, so we use the characteristic threshold that just rejects. If the threshold 
is set to below this value, the objects will be accepted. Similarly, any threshold 
above this value will result in rejection. The result of it is a monolithically 
increasing function. If the angle difference is very small, it tends to accept (i.e., 
small characteristic threshold). By the time the angle difference growing, it tends 
toward rejecting. At about 35 degree, even the largest threshold can recognize 
it (i.e., reject it). This result shows how the recognition is related with the 
difference of objects. It is a continuous procedure that the distance recognized 
by computer is changed with the difference of objects. 
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It is also interesting to see that how the selection of feature points affects the 
result. Previously the affection of selection of feature points is hard to observe 
because of the lack of comparison basis, i.e. , the selection is restricted by the 
prerequisite that all points be visible. And the objects are either too different 
or too same to make selection. Here we can make selection and make change by 
demand, so it’s easy to see the change. 




Looking from one side, mountains, from the other side, ranges 
Far, near, high, low, all different 

I can hardly recognize the true face of the Lu Shan mountain 
Simply because I am in the middle of it 



Written by Su Dong-po, 1 1th Century, Translated by P. Wang, 1997 

Fig. 4.2 A famous poem by a well known Chinese poet Su Dong-po (11th Century), 
translated to English by the author. 



If opening the lid doesn’t change all the selected points (i.e., no selection on 
the lid point), then the result is the same as same object recognition. This is 
what we expected: the computer still need human to provide the knowledge of 
selections. If opening the lid changes some of the points (i.e., there are some 
points selected on the lid), then the objects can be recognized differently. If we 
select the points that are across the opening, the distance will be obvious enough 
in any threshold. The reason is that when two points are across the opening, 
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they have more relative distance than other cases, in the extreme example, in 
one object the two points are separate and in the other object these two points 
are combined together. So their relative distance will be much more than other 
selections. 

Human eyes have similar phenomenon. We can hardly tell the difference 
between two lines with length 1 inch and 1.05 inch, but we can easily tell whether 
there is a 0.05 inch line or not. 

However, there are some exceptions or occluded areas in the experiments. 
To calculate the matrix transformation, the rank of 3x3 matrix should be 3 and 
should not be degraded during the series of rotation. So the selection of feature 
points shouldn’t be in the same plane, and all the points should be visible in 
the learning images. The folders are the example of selections in the same plane. 
(Actually the points in folders are in slightly different plane, but the difference 
is very small) Their distances are large because of this reason. 

In the future, we would like to explore automatic determination of corre- 
spondence points, and overcome some difficulties of misrecognition due to poor 
choice of threshold values. An inherent difficulty of recognizing 3D objects seems 
due to its so many variations, as many as infinite, from various angles, sizes and 
statuses. Interesting enough, this basic natural phenomenon happens to coincide 
with a famous poem written by a famous Chinese poet Su Dong-po about one 
thousand years ago, as shown in Figure 4.2. 
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Abstract. We propose an adaptive cache coherence-replacement scheme for 
web proxy cache systems that is based on several criteria about the system and 
applications, with the objective of optimizing the distributed cache system per- 
formance. Our coherence-replacement scheme assigns a replacement priority 
value to each cache block according to a set of criteria to decide which block to 
remove. The goal is to provide an effective utilization of the distributed cache 
memory and a good application performance. 



1 Introduction 

Many studies have examined policies for cache replacement and cache coherence; 
however, these studies have rarely taken into account the combined effects of policies 
[2, 6]. In this paper we propose an adaptive cache coherence-replacement scheme for 
web proxy cache systems. This work is based on previous work we have done on 
cache replacement mechanisms which have shown that adaptive cache replacement 
policies improve the performance of computing systems [1]. Our approach combines 
classical coherence protocols (write-update and write-invalid protocols) and replace- 
ment policies (LRU, LFU, etc.) to optimize the overall performance (based on criteria 
such as network traffic, application execution time, data consistence, etc.). The cache 
coherence mechanism is responsible for determining whether a copy in the distributed 
cache system is stale or valid. At the same time, it must update the invalid copies 
when a given site requires a block. Because a cache has a fixed amount of storage, 
when this storage space becomes full, the cache must choose a set of objects (or a set 
of victim blocks) to evict to make room for newly requested objects/blocks. The re- 
placement mechanism is used for this task. Our approach attempts to improve the 
performance of the distributed cache memory system by assigning a replacement 
priority value to each cache block according to a set of criteria to select the 
block/object to remove. To fix this priority, we take into account the state of the cache 
block. In addition, our scheme uses an adaptive replacement strategy that looks at the 
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information available to make the decision what replacement technique to use, without 
a proportional increase in the space/time requirements. 



2 Theoretical Aspects 

2.1 Coherence Problem 

Distributed cache systems provide decreased latency at a cost: every cache will some- 
times provide users with stale pages. Every local cache must somehow update pages 
in its cache so that it can give users pages which are as fresh as possible. Indeed, the 
problem of keeping cached pages up to date is not new to cache systems: after all, the 
cache is really just an enormous distributed file system, and distributed file systems 
have been with us for years. In conventional distributed systems terminology, the 
problem of updating cached pages is called coherence [2, 3, 5, 6, 8, 11, 14], Specifi- 
cally, the cache coherence problem consists of keeping a data element found in several 
caches current with each other and with the value in main memory (or local memo- 
ries). A cache coherence protocol ensures the data consistency of the system: the 
value returned by a read must always be the last value written to that location. There 
are two classes of cache coherence protocols [14]: write-invalidate and write-update. 
In a write-invalidate protocol, a write request to a block invalidates all other shared 
copies of that block. If a processor issues a read request to a block that has been in- 
validated, there will be a coherence miss. In a write-update protocol on the other hand, 
each write request to shared data updates all other copies of the block, and the block 
remains shared. Although there are fewer read misses for a write-update protocol, the 
write traffic on the bus is often so much higher that the overall performance is de- 
creased. A variety of mechanisms have been proposed for solving the cache coherence 
problem. The optimal solution for a multiprocessor system depends on several factors, 
such as the size of the system (i.e., the number of processors), etc. 



2.2 Replacement Policy Problem 

A replacement policy specifies which block should be removed when a new block 
must be entered into an already full cache; it should be chosen so as to ensure that 
blocks likely to be referenced in the near future are retained in the cache. The choice 
of replacement policy is one of the most critical cache design issues and has a signifi- 
cant impact on the overall system performance. Common replacement algorithms used 
with such caches are [1, 4, 7, 9, 10, 15]: 

• First In-First Out (FIFO): this is the simplest scheme; it is easily managed with a 
FIFO queue. When a replacement is necessary the first block entered at the cache 
memory (at the head of the queue) must be removed. 
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Most Recent Used (MRU): Replaces the block in the cache, which has been more 
recently used. This is not used frequently on cache memory system because it has 
bad temporal locality. It is a typical property of the memory reference patterns of 
processors, page reference patterns in virtual memory patterns, etc. 

Least Recently Used (LRU): Replaces/evicts the block/object in the cache that has 
not been used for the longest period of time. The basic premise is that blocks that 
have been referenced in the recent past will likely be referenced again in the near 
future (temporal locality). This policy works well when there is a high temporal 
locality of references in the workload. There is a variant, called Early Eviction 
LRU (EELRU), proposed in [7], EELRU performs LRU replacement by default 
but diverges from LRU and evicts pages early when it notes that too many pages 
are being touched in a roughly cyclic pattern that is larger than the main memory. 
Least Frequently Used (LFU): It is based on the frequency with which a block is 
accessed. LFU requires that a references count be maintained for each block in 
the cache. A block/object’s referenced count is incremented by one with each ref- 
erence to it. When a replacement is necessary, the LFU replaces/evicts the 
blocks/objects with the lowest reference count. The motivation for LFU and other 
frequency based algorithms is that the reference count can be used as an estimate 
of the probability of a block being referenced. In [7], Lee et al. show that there 
exists a spectrum of block replacement policies that subsumes both the LRU and 
LFU policies. The spectrum is formed according to how much more weight is 
given to the recent history over the older history and is referred to as the LRFU 
(Least Recently/Frequently Used) policy. 

Least Frequently Used (LFU)-Aging: The LFU policy can suffer from cache 
pollution (an effect of temporal locality): if a formerly popular object becomes 
unpopular, it will remain in the cache for a long time, preventing other newly or 
slightly less popular objects from replacing it. LFU-Aging addresses cache pollu- 
tion when it considers both a block/object’s access frequency and its age in cache. 
One solution to this is to introduce some form of reference count “aging”. The av- 
erage reference count is maintained dynamically (over all blocks currently in the 
cache). Whenever this average counts exceeds some predetermined maximum 
value (a parameter to the algorithm) every reference count is reduced. There is a 
variant, called LFU with Dynamic Aging (LFUDA), that uses dynamic aging to 
accommodate shifts in the set of popular objects. 

Greedy Dual Size (GDS): It combines temporal locality, size, and other cost in- 
formation. The algorithm assigns a cost/size value to each cache block. In the 
simplest case the cost is set to 1 to maximize the hit ratio, but costs such as la- 
tency, network bandwidth can be explored. GDS assigns a key value to each ob- 
ject. The key is computed as the object’s reference count plus the cost information 
divided by its size. The algorithm takes into account recency for a block by in- 
flating the key value (cost/size value) for an accessed block by the least value of 
currently cached blocks. The GDS -aging version adds the cache age factor to the 
key factor. By adding the cache age factor, it limits the influence of previously 
popular documents. The algorithm is simple to implement with a priority queue. 
There are several variations of the GDS algorithm each of which takes into ac- 
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count coherency information and the expiration time of the cache ( GDSlifetime ). 
The second variation uses the observation that different types of applications 
change their references at different rates ( GDStype ). A last GDS variation is 
GDSlatency, which uses as key value for an object the quantity latency/size where 
latency is the measured delay for the last retrieval of the object. 

• Frequency Based Replacement (FBR): This is a hybrid replacement policy, at- 
tempting to capture the benefits of both LRU and LFU without the associated 
drawbacks. FBR maintains the LRU ordering of all blocks in the cache, but the 
replacement decision is primarily based upon the frequency count. To accomplish 
this, FBR divides the cache into three partitions: a new partition, a middle parti- 
tion and an old partition. The new partition contains the most recent used blocks 
(MRU) and the old partition the LRU blocks. The middle section consists of those 
blocks not in either the new or the old section. When a reference occurs to a block 
in the new section, its reference count is not incremented. References to the mid- 
dle and old sections do cause the reference counts to be incremented. When a 
block must be chosen for replacement, FBR chooses the block with the lowest 
reference count, but only among those blocks that are in the old section. 

• Priority Cache (PC): Uses both runtime and compile-time information to select a 
block for replacement. PC associates a data priority bit with each cache block. 
The compiler, through two additional bits associated with each memory access in- 
struction, assigns priorities. These two bits indicate whether the data priority bit 
should be set as well as the priority of the block, i.e., low or high. The cache 
block with the lowest priority is the one to be replaced. 

In general, the policies anticipate future memory references by looking at the past 
behavior of the programs (program’s memory access patterns). Their job is to identify 
a line/block (containing memory references) which should be thrown away in order to 
make room for the newly referenced line that experienced a miss in the cache. 



3 An Adaptive Coherence-Replacement Policy 

The growth of the Internet and the WWW has significantly increased the amount of 
online information and services available. However, the client/server architecture 
employed by the current Web-based services is inherently unscalable. Web caches 
have been proposed as a solution to the scalability problem [4, 5, 6, 8, 12, 16]. Web 
caches store copies of previously retrieved objects to avoid transferring those objects 
in response to subsequent requests. Web caches are located throughout the Internet, 
from the user's browser cache through local proxy caches and backbone caches, to the 
so-called reverse proxy caches located near the origin of the content. Client browsers 
may be configured to connect to a proxy server, which then forwards the request on 
behalf of the client. All Web caches must try to keep cached pages up to date with the 
master copies of those pages, to avoid returning stale pages to users. There are strong 
benefits for the proxy to cache popular requests locally. Users will receive cached 
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documents more quickly. Additionally, the organization reduces the amount of traffic 
imposed on its wide-area Internet connection. 

Because a cache server has a fixed amount of storage, the server needs a cache re- 
placement mechanism [4, 6]. Recent studies on web workload have shown tremendous 
breadth and turnover in the popular object set-the set of objects that are currently be- 
ing accessed by users [16]. The popular object set can change when new objects are 
published, such as news stories or sports scores, which replace previously popular 
objects. We should define cache replacement policies based on this workload charac- 
terization. In addition, a cache must determine if it can service a request, and if so, if 
each object it provides is fresh. This is a typical question to be solve with a cache 
coherence mechanism. If the object is fresh, the cache provides it directly, if not, the 
cache requests the object from its origin server. 

Our adaptive coherence-replacement mechanism for Web caches is based on sys- 
tems like Squid [13], which caches Internet data. It does this by accepting requests for 
objects that people want to download and by processing their requests at their sites. In 
other words, if users want to download a web page, they ask Squid to get the page for 
them. Then Squid connects to the remote server and requests the page. It then trans- 
parently streams the data through itself to the client machine, but at the same time 
keeps a copy. The next time someone wants that same page. Squid simply reads it 
from its disks, transferring the data to the client machine almost immediately (Internet 
caching). Normally, in Internet caching cache hierarchies are used. The Internet Cache 
Protocol (ICP) describes the cache hierarchies. The ICP’s role is to provide a quick 
and efficient method of intercache communication, offering a mechanism for estab- 
lishing complex cache hierarchies. ICP allows one cache to ask another if it has a valid 
copy of a object. Squid ICP is based on the following procedure [13]: 

1. Squid sends an ICP query message to its neighbors (URL requested) 

2. Each neighbor receives its ICP query and looks up the URL in its own cache. If a 
valid copy exists, the cache sends ICP_HIT, otherwise ICP_MISS 

3. The querying cache collects the ICP replies from its peers. If the cache receives 
several ICP_HIT replies from its peers (neighbors), it chooses the peer whose re- 
ply was the first to arrive in order to receive the object. If all replies are 
ICP_MISS, Squid forwards the request to the neighbors of its neighbors, until to 
find a valid copy. 

Neighbors refer to other caches in a hierarchy (a parent cache, a sibling cache or the 
origin server). Squid offers numerous modifications to this mechanism, for example: 

Send ICP queries to some neighbors and not to others 

Include the origin sever in the ICP "ping" so that if the origin servers reply arrives 
before any ICP-hits, the request is forward there directly. 

Disallow or require the use of some peers for certain requests. 

In this case, each cache block is in the following state: 

Invalid: a stale copy. 
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Normally, there is only one state because the users typically do not write. Then, the 
adaptive cache coherence-replacement mechanism is as follows: 

1 . If read miss then 

1.1 Search for a valid copy (using the ICP). A read-miss request is sent using 
the ICP 

1.2 If cache is full, choose a replacement policy according to a decision system 

1.3 Receive a valid copy 

1 .4 Read block 

2. If read hit then 
2. 1 Read block 

3.1 The Replacement System 

Normally, user cache access patterns affect cache replacement decisions while block 
characteristics affect cache coherency decisions. Therefore, it is reasonable to consider 
replacing cache blocks that have expired or are closed to expiring because their next 
access will result in an invalidation message. In this way, we propose a cache coher- 
ence-replacement mechanism that incorporates the state information into an adaptive 
replacement policy. The basic idea behind the proposed mechanism is to combine a 
coherence mechanism with our adaptive cache replacement algorithm [1, 2], Our 
adaptive cache coherence-replacement mechanism exploits semantic information 
about the expected or observed access behavior of particular data shared objects on the 
size of the cache items, and the replacement phase employs several different mecha- 
nisms, each one appropriate for a different situation. Since our coherence-replacement 
is provided in software, we expect the overhead of providing our mechanism to be 
offset by the increase in performance that such a mechanism will provide. That is, in 
our approach we examine if the overall performance can be improved by considering 
coherency issues as part of the cache replacement decision. We incorporate the addi- 
tional information about a program’s characteristics, which is available in the form of 
the cache block states, in our replacement system. Thus, we define a set of parameters 
that we can use to select the best replacement policy in a dynamic environment: 

A) Information about the system 

• Workload, Bandwidth, Latency, CPU Utilization. 

• Type of system (Shared memory, etc.) 

B) Information about the application 

• Information about the data and cache block or objects (Frequency, Age, Size, 
Length of the past information (patterns), State (invalid, shared, etc.)). 

• Type an degree of access pattern on the system (High or low spatial locality (SL), 
High or low temporal locality (TL)). 

C) Other information 

• Cache conflict resolution mechanism 

• Pre-fetching mechanism 
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An optimal cache replacement policy would know the future workload. In the real 
world, we must develop heuristics to approximate ideal behavior. For each of the 
policies we discussed in section 2.2, we list the information that is required by them: 

• LFU: reference count. 

• LRU: the program’s memory access patterns. 

• Priority Cache: information at runtime or compile time (data priority bit by 
cache/block). 

• Prediction: a summary of the entire program’s memory access pattern. 

• FBR: the program’s memory access patterns and organization of the cache mem- 
ory. 

• MRU: the program’s memory access patterns. 

• FIFO: the program’s memory access patterns. 

• GDS: size of the objects, information to calculate the cost function, reference 
count. 

• Aging approaches: GDS-aging: GDS age factor; LFU-aging: LFU age factor. 

We define one expression, called the key value, to define the priority of replace- 
ment of each block/object. According to this value, the system chooses the block with 
higher priority to replace (low key value). The key value is defined as: 

Key- Value = (CF+A+FC)/S + cache factor (1) 

where, - FC is the frequency/reference count, that is the number of times that a 
block has been referenced, 

- A is the age factor, 

- S is the size of the block/object, 

- CF is the cost function that can include costs such as latency or network 

bandwidth. 

The first part of Equation (1) is typical for the GDS, LRU and LFU policies (using 
information about objects to reference and not about cache blocks). The cache factor is 
defined according to the replacement policy used: 

• LFU: blocks with a high frequency count have the highest cache factor. 

• LRU: the least recently used block has the highest cache factor. 

• Priority Cache: defined at runtime or compile-time. 

• Prediction: the least used block in the future has the highest cache factor. 

• FBR: the least recently used block has the highest cache factor. 

• MRU: the most recently used block has the highest cache factor. 

• FIFO: the block at the head of the queue has the highest cache factor. 

• GDS: not applicable. 

• Aging approaches: FC/A, with a reset factor that restarts this value after a given 
number of ages or when the age average is more than a given value. 
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The coherence-replacement policy defines the cache factor so that: blocks in invalid 
state have the highest priority to be chosen to replace. Otherwise, blocks in shared 
states must be chosen to replace, then blocks in exclusive states, and finally, blocks in 
modified states. If there are several blocks in a particular state, we use the replacement 
policy specified in our decision system [1]. The decision system is composed of a set 
of rules to decide the replacement policy to use. Each rule selects a replacement policy 
to apply according to different criteria: 

If TL is high and the system 's memory access pattern is regular then 
Use a LRU replacement policy 

If TL is low and the system ’s memory access pattern is regular then 
Use a LFU replacement policy 

If TL is low and the system ’s memory access pattern is large then 
Use a MFU replacement policy 

If we require a precise decision using a large system’s memoiy access pattern 
history then 

Use a Prediction replacement policy 
If objects/blocks have variable sizes then 
Use a GDS replacement policy 
If a fast decision is required then 

Use a RAND replacement policy 
If there is a large number of LRU candidate blocks then 
Use a FBR replacement policy 
If SL is high then 

Use a hybrid FBR + GDS replacement policy 
If the system ’s memoiy access pattern is irregular then 
Use an age replacement policy 



4 Result Analysis 

We constructed a trace-driven simulation to study our approach using a set of client 
traces from Digital Equipment Corporation [6]. We compare our approach with [6]. 
These traces are distinguished from many proxy logs in that they contain last modifi- 
cation time. We use four evaluation criteria: response latency, bandwidth, hit rates and 
number of request. We use a normalized cost model for each of these criteria where 
each of these costs is defined 0 if a "get request" can be retrieved from the proxy 
cache, or 1 for a "get request" to a server. The total cost for a simulation is the average 
of these normalized costs. Figure 1 shows the average costs of the best policy pro- 
posed on [6] and of our work. The approach proposed on [6] has the highest cost. For 
a 10 GB cache, the cost saving is 4%. Our results indicate that for caches where the 
cache space is small, the cache replacement policy primarily determines the costs. For 
cache operating in configurations with large amounts of cache space, the cache coher- 
ency policy primarily determines the overall costs. To reduce the overhead of our 
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approach, we can make an appropriate inclusion of coherency characteristics on the 
replacement policy. 



Average 

Cost 




5 Conclusions 

The goal of this research was to formulate an overarching framework subsuming vari- 
ous cache management strategies in the context of different distributed platforms. We 
have proposed an adaptive coherence-replacement policy. Our approach includes 
additional information/factors such as frequency of block use, state of the blocks, etc., 
in replacement decisions. It takes into consideration that coherency and replacement 
decisions affect each other. This adaptive policy system has been validated by experi- 
mental work. Our majors results are: a) cache replacement and coherency are both 
important in reducing the costs for a proxy cache, b) direct inclusion of cache coher- 
ency issues maybe can reduce the overhead of our approach but doesn’t guarantee a 
better performance. 
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Abstract. Construct web applications to provide dynamic, personalized web 
contents with high scalability and performance is a challenge to the software 
industry in the next century. In most available solutions, load balancing and 
caching mechanisms are introduced in front of web servers to reduce workload. 
In this paper we present Content Request Markup Language (CRML), an 
enabling technique for distributed XML processing at the content level. CRML 
is a language based on emerging XML standards, XSLT and XPATH, to 
publish XML-based content over HTTP protocol. It provides hints to construct 
a distributed framework to support parallel XML-based content publishing. In 
addition, the content from databases or other sources could be cached before or 
after processing in block or page level. With the parallel content publishing and 
the caching mechanism, the CRML could provide a high performance platform 
for fully customized web service. 



1 Introduction 

While Tim Berners-Lee created the first web site to provide hypermedia information 
at the distributed environment, nobody believed that he would change the world. 
However, he did. The invention changes the way people communicate, learn, and 
even live. In the end of twenty century, the old economy was replaced by the e- 
commerce, we called it new economy, which is mostly built on the World Wide Web 
technologies. 



1.1 Building a Web Application 

The explosive growth of the World Wide Web over the last few years continues 
unabated. The Web has evolved from sites that serve static HTML pages to a global 
arena for recreation, information, and business transactions. 

Web applications use enabling technologies to make their content dynamic and to 
allow users of the system to affect business logic on the server. To support the web 
applications, web servers start to provide the extension mechanisms. Developers of 
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the web applications use these mechanisms to connect the business logics and web 
applications. The first & most popular approach is Common Gateway Interface (CGI). 

The CGI has a lot of disadvantages such as hard-to-write and forking processes, 
and some solutions are presented to the public to replace its position. The most 
famous solution is the Active Server Pages from Microsoft. It uses the simple script 
language embedded in the HTML to process the data. Besides, the Java Servlet & 
Java Server Pages on Java Platform provide similar solutions but based on the Java 
programming language. 

No matter what solution to use, the web applications always do the two jobs: 

• Retrieve the Content 

• Present the Content to the users. 

The web applications retrieve the content from databases, files, or the enterprise 
information systems, and the retrieved content will be formatted to HTML and sent to 
users. Sometimes, the update of the content will be sent to the date sources. 



1.2 Caching & Load Balancing 

When the number of Internet users is increasing, the web servers get incredible heavy 
traffic on their service. The caching is applied to reduce the load of the web server. 
Usually, the caching system is placed in front of the web server to cache each request 
with matched parameters. Such system has a big problem on the fully customized 
web-service because each user may get different page on the same content, which is 
generated by his preference. Some solutions put the cache in the back of the web 
applications, which cache the data from the databases. 

The Load Balancing is used to increase the capacity to handle the traffic. The 
common solution is using duplicated machine to provide the same service. The 
problem is that some resources may be allocated in each machine and some 
information such as session information is not easy to be migrated. 



1.3 The CRML Framework 

The CRML Framework is introduced based on the idea on the previous discussion. 
First, it separates the two job of the web application. The Retrieving of the Data is 
made by some Content Request Blocks. Second, then the XSL Transformation is used 
to present the data to the users. The strategy is “Divide and Conquer’’. The content 
used to generate a requested page is separated to several Content Request Blocks, 
which describe the data source and how to retrieve the data. When all required data is 
ready to process, the XSL Transformation is used to process the data. 

Each XML file with CRML Content Request Blocks should be saved in a file with 
the extension name “.sxml” to tell the web server process the file with CRML Engine. 
Such XML files are named “CRML Source Files”. 
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Fig. 1. Diagram for processing a CRML Source File 

The whole concept of this section could be represented in the Figure 1 . A CRML 
Source File is an XML file in user-assigned namespace with one or more Content 
Request Blocks. The CRML Engine can process the elements in the CRML’s 
namespace and generate the contents in user-assigned namespace. Besides, the 
developer of the CRML Source File could add some special Processing Instructions 
for CRML Engine. These Processing Instructions could setup some page-level 
properties for CRML Engine. 



2 The Content Request Block 

In previous section, we introduce that the core strategy of this framework is "Divide 
and Conquer". The pages requested by users are composed of some data sources. 
Each data source should be described to a Content Request Block: CRB. A CRB 
could describe a data source from database, file, XML database, and the only 
requirement is that the data should be modeled into a XML document. The 
requirement is due to the processing of the content is based on XML as a data type. 

2.1 The Data Source & Transformation 

The data sources in a requested web page are various. A page could be composed 
of the information from databases, files, or remote service provider. The following 
example access the content in the SQL databases. 

<crml : content id="query2" 

xmlns : we= "http : / /www . nctu . edu . tw / 2000/ CRML " > 

<crml : source type="sql"> 

<crml:attr name="db" value=" j dbc : db2 : sample" /> 

[! CDATA [select id, name from users]] 

< / crml : source? 

<crml : transformation apply= " table . xsl " 
cache = "before " /> 

< / crml : content? 
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The data from a SQL database should not be represented in the form of XML 
because the CRML need the result of the data in XML. The data in a previous 
example should be transformed to a XML document like following code fragment. 

<RESULT> 

<ROWxid>chchiu< / idxname>Chi -Huang Chiu< / name x /ROW > 

</RESULT> 

A XSL Transformation could be invoked after the data is acquired. The above 
example just shows the usage and capability of the data sources in CRML. Although 
the CRB is a fundamental of the CRML but the source of the CRB is independent of 
the rest processes to handle the whole document. The only requirement is XML. The 
CRML create the possibility to adapt the all kind of data sources in to the CRML if 
the processor of the CRML could handle the kind of the source. 

2.2 Caching 

In order to improve the efficiency of processing a requested page, the caching is 
required. Unlike the mechanism provide by most existing publishing framework 
which cache each page of data. The CRML provides a caching mechanism in CRB- 
level to cache the content before processing them. The concept is simple that a CRB 
should be requested with some parameters. The CRML assumes that the results of the 
CRBs with the same parameters should be the same. In this assumption, the processor 
could reduce the effort to retrieve the data if the parameters are matched. The 
following code gives an example of caching a CRB: 

<crml : content id="query2"> 

<crml: cache keys="div, dept" /> 

<crml: source type="sql"> 

<crml:attr name="db" value=" j dbc : db2 : sample" /> 

[! CDATA [select id from users where div='@{div}' and 
dept= ' @{dept } ' ] ] 

El 

In above CRB, the content is from the SQL databases and the div and dept are the 
variables passed as the parameters. The <crml:cache> tag provide the caching 
information of the block that if two requests with the same div and dept value should 
get the same result. Besides, the keys could be a time-variation variable like following 
example that could delay the request in assigned time: 

<crml : content id="query2"> 

<crml:cache name= "userquery" keys=" ~ { # {min} /5 } " /> 
<crml: source type="sql"> 

<crml:attr name="db" value=" j dbc : db2 : sample" /> 

[! CDATA [select id from users where div='@{div}' and 
dept= '@{dept} ' ] ] 



1 To reduce the space, some tags and attributes are ignored. 
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The example shows a notation of Expression Variable, which could enclose an 
expression as a variable. In such example, the variable ~{#{min}/5} represent the 
value of the numbers of minutes from 1970 divide 5. The variable has the same value 
in the same block of 5 minutes and could reduce the effort of processing. Besides, the 
name of the cached element could let the CRBs in different CRML document share 
the same cache data. The scheme is suitable for the CRB, which is accessed 
frequently, but the real-time accuracy is not pretty important. 

In fact, the cache should be invalidated when the data in the data source is changed. 
The fourth section of the paper will discuss the way of cache invalidation to insure the 
accuracy of the content. 



3 Inter- CRB Communication and CRB Merging 

The CRB is the content provider of the requested page, and the CRML processor will 
merge one or more CRBs into a complete document. A CRML document has its 
owned Document Type such as XHTML and the CRBs will replace some nodes in the 
document, which information is dynamic. The CRML not only define the CRB to 
reference the dynamic data sources but also define some Processing Instructions to 
provide some hints to processing the document. 



3.1 Variables 

The variables play an important role on generating dynamic content like some 
examples in previous section. In CRML framework, the only supported data type is 
string. The reason for this restriction is that both HTTP parameters and XML 
documents use string as their data type. The initial set of variables is from the HTTP 
request. All CRBs in the document could use the variables in the form of 
“@{variable name}”. If a CRB want to pass some information to another CRB, it 
could change the value of the variables using the “crml: variable” tag like the 
following example: 

<crml : content id="query2" 

xmlns : we= "http : / /www . nctu . edu . tw/2 000/ CRML " > 

<crml: source type="sql"> 

<crml:attr name="db" value= " j dbc : db2 : sample " /> 

[! CDATA [select name from users where id='@{id}']] 

< / crml : source? 

<crml : variable name="name" catch= "RESULT/ROW [0] /id" /> 

< / crml : content? 

The catch attribute use XPATH to specify the data to catch from the data from the 
data sources. If the result of the XPATH is a XML document, only the character part 
of the document will be used as the value of the variable. In addition to the 
substitution in CDATA section, the variables will be pass to XSLT processors and 
each CRB processor. The Expression Variable in section 2.2 is used to modify the 
variable before using them and reduce the complex of the framework. The System 
Variable started with pond symbol is used to reference some system variable like 




90 



C.-H. Chiu, K.-C. Liang, and S.-M. Yuan 



system time, which is read-only. If there is any error on retrieving the value of the 
variable, a string with zero length will be returned or set. 



3.2 Concurrency Control 

A CRB may need to wait another CRB to store its result in the variable. Another 
situation is that a CRB could not be processed safely when another CRB in the same 
CRML document is processing. Because the concurrent access to the same resources 
may lead to data inconsistency. In the first CRB in next example, the authorization is 
checked by the first CRB and the other CRB needs to wait the CRB, which do the 
authorization. 

<crml : content id="auth"> 

<crml : concunrrency threadsaf e= "no" / > 

<crml: source type="sql"> 

<crml:attr name="db" value= " j dbc : db2 : sample " /> 

[! CDATA [select name from users where id='@{id}' and 
passwords @{PASSWD} ' ] ] 

< / crml : source> 

<crml : variable name="name" catch= "RESULT/ROW [0] /id" /> 
<crml : exception code="100" type="page"> 

<?xml-stylesheet href =" error . xsl " type="xsl"?> 

<error type= " autherror " > 

The User/Password you inputted is not correct ! 

< / error> 

< / crml : exception? 

< / crml : content? 

<crml : content id= " secdata" ? 

<crml : concunrrency threadsafe="no" wait= " auth" /? 



< / crml : content? 

The threadsafe attribute in the “crmkconcurrency” tag is used to specify if the CRB 
could be processed with another CRB. The wait attribute obviously is used to specify 
the CRBs to wait before process it. The default values of “threadsafe" and "wait" are 
“yes” and an empty list. 



3.3 Exception Handling 

Each data source may throw exceptions to CRML Processor when some errors rise. 
The “crml: exception” tag is used to handle the exception from CRML Processor. In 
the designed of CRML, each exception is identified by an integer. Like previous 
example, the exception code 100 will be caught and whole requested page will be 
replaced. If the value of the “type” is “crb” not “page”, only the result of the CRB will 
be replaced. 
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When a page exception is raised and caught, all waiting CRBs will be canceled and 
the result of CRBs, which is done, will be discarded. The sequence of processing 
CRBs is undefined if the “crmlxoncurrency” tag is not specified and the first page 
exception will replace the whole result. In previous example, the waiting of 
authorization CRB could be removed since the fail of authorization will replace the 
secured data. 



4 The Caching Mechanism 

In the second section, how to cache a CRB is introduced but the assumption, that the 
content is not change, is made which is not reasonable for a dynamic web site. 
Usually, web server is just a part of the whole information system; the content may 
retrieve from database, which could be updated via other interfaces. To avoid return 
out-of-date information, we try to invalidate the entries that are updated. 



4.1 Buffer Pool 

In the CRML, each cache tag will reference to a named buffer pool, which contains a 
map from keys to XML document fragments. In the default, each buffer pool has a 
name generated by the CRML processor but the name could also be assigned by the 
CRML which may let different CRML source files access the same buffer pool. 



4.2 Self-Invalidation 

Some content has the information about modification date or expiration information, 
which could be used to invalidate the buffer. When the buffer pool is requested to 
provide the content with such information, the buffer pool will try to check the date 
between the original and the cached content to determinate that if the access to 
original is required. 

The well-known content has such capability is file. A file contains the timestamp 
of the last modification, which could be used to verify the correctness of the 
information in the buffer pool. The effort to do the self-invalidation is much less than 
opening a file and processing it. 



4.3 Event-Driven Invalidation 

Another solution to insure the real-time content is Event-driven Invalidation. The 
method based on an event-driven bus on which the other part of the whole 
information system could broadcast the event. The buffer pools adapted for the bus 
could subscribe the event to invalidate the entries in the buffer pools. 

The Fig. 2 is an example of the scenario, which the event source is a DBMS. An 
event generator could be set via the replication module on the DBMS to publish the 
event. The buffer pool will invalidate the buffer when it gets the event from the 
database, which indicates the modification of the data. 
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Fig. 2. Event-driven Invalidation 

Such passive invalidation is not real-time accuracy but useful because the buffer 
pool don’t need to check before response that may increase the response time. The 
time to trigger the modification could vary depended on the source. Usually, the 
replication module will fire the modification of the data each five to twenty minutes. 
If the applications, which modified the data could also fire the event, the time need to 
wait the correct information could less than five second. 



5 Implementation for Parallel Processing 

The CRML doesn’t include the detail information about how to process a CRML 
source file and could be vary depended on the implementation. In this section, the 
implementation for parallel processing will be discussed to show the capability of the 
CRML. 



5.1 The CRML Engine 



The CRML Engine is the component to process a whole request. The architecture of 
the Engine is shown on Figure 3. The request will be associated with a CRML source 
file and be passed to the Engine with request parameters. The engine will try to use 
the page-level cache before process the CRML source file. If the cache is missed the 
engine will start to process the file. 

First, the XSLT processor will be used if the pre-formatter or the post-formatter is 
assigned. Second, the compiler of the CRML will be invoked to compile the CRML 
source file and generate the access plan of the request. The access plan will be sent to 
the runtime and executed. Finally, if the result from the runtime contains other CRBs 
to process, the compiler will try to generate a access plan for next round. 




Fig. 3. The Architecture of the CRML Engine 
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The Access Plan 

The Access Plan is some link-lists generated from the Compiler. According to the 
“concurrency” tag of each CRB, related CRBs will be put in the same link-list and in 
correct sequence. The plan will give the CRML runtime some information about how 
to execute the whole CRML file in a correct way. 

The CRML Runtime 

Like the Fig. 4, the access plan will be dispatched to several threads to execute. The 
CRB Executor is the core of each thread that executes the CRBs by the help of CRB 
Processors and Buffer Pools. The Result Collector could collect and manage the 
results and exceptions from each thread. The semaphores in the Result Collector also 
do the concurrency control. 




Fig. 4. The internal design of a CRML Runtime 



5.2 Configuration for Distributed Environment 

The previous focus of the Parallel Processing is the multi-thread processing in a single 
machine. This is helpful to decrease the response time because retrieving the data, 
such as a SQL query, needs to wait. If the multi-thread processing is applied, the 
CRB, which doesn't need to wait other CRB, could be processed together. If each 
thread could be separated to different machine, the benefit not only reduces the 
response time but also increases the capacity of the system. 



6 Conclusions and Future Works 

In this paper, we have presented the core of CRML, a distributed framework for 
XML-based Content Publishing. Comparing with other publishing framework or 
tools, CRML: 1) stresses the distributed architecture which could be used either to do 
the load balancing for a heavy traffic website or to perform multithread processing in 
a single machine to improve performance; 2) provides the block-level and content- 
independent cache for the data before processing which is a great improvement for 
fully-customized web service; 3) uses the emerging XML standards to process the 
XML-based data without any sequential programming language. 

CRML is a part of WebEngine, a developing project of Web Content Management 
System. The concept of WCMS is like the DBMS, which both manage the data and 
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handle the request. As the role of SQL in DBMS, the CRML is designed to describe 
what user wants to request not how to handle the request. Such spirit could let 
WebEngine improve the performance using a lot of techniques without change the 
original CRML source file. Since the CRML is used to request the content, the 
transaction management is not mentioned in the first version of design, which may be 
improved as future works. 
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Abstract. Most people store ‘bookmarks’ to web pages. These allow 
the user to return to a web page later on, without having to remem- 
ber the exact URL address. People attempt to organise their bookmark 
databases by filing bookmarks under categories, themselves arranged in 
a hierarchical fashion. As the maintenance of such large repositories is 
difficult and time-consuming, a tool that automatically categorises book- 
marks is required. This paper investigates how rough set theory can help 
extract information out of this domain, for use in an experimental auto- 
matic bookmark classification system. In particular, work on rough set 
dependency degrees is applied to reduce the otherwise high dimensional- 
ity of the feature patterns used to characterize bookmarks. A compari- 
son is made between this approach to data reduction and a conventional 
entropy-based approach. 



1 Introduction 

As the use of the Web becomes more prevalent and the size of personal reposi- 
tories grows, adequately organising and managing bookmarks becomes crucial, 
somewhat analogous to the need to organise files in a private disk. Several years 
ago, in recognition of this problem, web browsers included support for tree- 
like folder structures for organising bookmarks. These enable the user to browse 
through their repository to find the necessary information. However manual URL 
classification and organisation can be difficult and tedious when there are more 
than a few bookmarks to classify - something that goes against the whole grain 
of the bookmarking concept. 

An empirical study on users’ World Wide Web page revisitation patterns 
(as reported in JQ) found that 58% of pages viewed are revisits. So over half of 
the instances where a user accesses a page, they are revisiting it (probably via 
their bookmark database). Another survey was carried out by the GVU’s WWW 
Surveying Team | 2 ] to determine which bookmarking activities are performed by 
different groups of people. Most respondents create entries (86%), delete entries 
(74%), create folders (70%) and rearrange entries (63%), with only 4% saying 
that they do not use them at all. Those creating sub-folders, however, were 
comparatively low. 



N. Zhong et al. (Eds.): WI 2001, LNAI 2198, pp. 95-|TH5| 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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This suggests that although people spend time creating and rearranging their 
bookmarks, the hierarchy tends to have a shallow tree-like structure. This could 
be for the following reasons: 



— Many usability studies, for example 0, indicate that a deep hierarchy results 
in less efficient information retrieval as many traversal steps are required, so 
users are more likely to make mistakes. 

— Users do not have the time/patience to arrange their collection into a well- 
ordered hierarchy. Also, if the tree has been ordered and is quite deep, it can 
take too long to traverse the sub-folders to reach the desired bookmark. 

It seems, then, that there is a need for a tool that can automatically create 
folders and sub-folders and classify bookmarks into them. Surprisingly, few such 
systems are in existence; two worth noting are the Bookmark Organiser 
|Q and PowerBookmarks jS|. However, these approaches rely on information 
other than that contained in the bookmark databases. Both applications use 
the information contained in the documents pointed to by the URLs in order to 
generate classifications. 

Many classification problems involve high dimensional descriptions of input 
features. It is therefore not surprising that much research has been done on 
dimensionality reduction. However, existing work tends to destroy the underlying 
semantics of the features after reduction (e.g. transformation-based approaches 
JSJ) or require additional information about the given data set for thresholding 
(e.g. entropy-based approaches [Zj). A technique that can reduce dimensionality 
using information contained within the data set and preserving the meaning of 
the features is clearly desirable. Rough set theory can be used as such a tool to 
discover data dependencies and reduce the number of attributes contained in a 
dataset by purely structural methods. 

The rest of this paper is structured as follows. Section 2 introduces the main 
approach to dimensionality reduction, namely Rough Set Attribute Reduction 
and also highlights the operation of an additional technique, Entropy-based Re- 
duction. The modular design of the bookmark classification system is described 
in section 3; each module involved is detailed. Section 4 presents the experi- 
mental results obtained and section 5 concludes the paper and mentions some 
important future work. 



2 Dimensionality Reduction 

The datasets generated in Information Retrieval systems tend to be extremely 
large, rendering most classifiers intractable. This results in the need for a mech- 
anism that will greatly reduce the dimensionality of these datasets, whilst re- 
taining important information. To be self-contained, this section presents those 
techniques that have been developed for this purpose. 
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2.1 Rough Set-Based Reduction 

A rough set pj is an approximation of a vague concept by a pair of precise 
concepts, called lower and upper approximations (which are a classification of the 
domain of interest into disjoint categories). The classification formally represents 
knowledge about the problem domain. Objects belonging to the same category 
characterized by the same attributes (or features) are not distinguishable. 

Central to Rough Set Attribute Reduction (RSAR) is the concept of indis- 
cernibility. Let I = ( U,A ) be an information system, where U is a non-empty 
set of finite objects (the universe). A is a non-empty finite set of attributes such 
that a : U — »• V a for every a £ A; V a is the value set for attribute a. In a decision 
system, A = {C U D} where C is the set of conditional attributes and D is the 
set of decision attributes. With any P C A there is an associated equivalence 
relation IND(P): 

IND(P) = {(x,y) £ U 2 | V a £ P a(x) = a(y)j (1) 

If (x,y) £ IND(P), then x and y are indiscernible by attributes from P. The 
equivalence classes of the P-indiscernibility relation are denoted [x]p. Let X C 
U, the P -lower approximation of a set can now be defined as: 



PX = {x | [x] P C X} 



(2) 



Let P and Q be equivalence relations over U, then the positive region can be 
defined as: 



POS P (Q)= jj PX (3) 

xeu/Q 

The positive region contains all objects of U that can be classified to classes of 
U/ Q using the knowledge in attributes P. 

An important issue in data analysis is discovering dependencies between at- 
tributes. Intuitively, a set of attributes Q depends totally on a set of attributes 
P, denoted P => Q, if all attribute values from Q are uniquely determined by 
values of attributes from P. If there exists a functional dependency between val- 
ues of Q and P, then Q depends totally on P. Dependency can be defined in the 
following way: 

For P, Q C A, Q depends on P in a degree k (0 < k < 1), denoted P =>*, Q, if 



k = 7 p(Q) 



\POSp(Q)\ 

|U| 



( 4 ) 



If k = 1 Q depends totally on P, if k < 1 Q depends partially (in a degree k) on 
P, and if k = 0 Q does not depend on P. 

By calculating the change in dependency when an attribute is removed from 
the set of considered conditional attributes, a measure of the significance of the 
attribute can be obtained. The higher the change in dependency, the more sig- 
nificant the attribute is. If the significance is 0, then the attribute is dispensible. 
More formally, given P, Q and an attribute x £ P, 



98 



R. Jensen and Q. Shen 



vp(Q,x) = 1 p{Q) -7p-{x}(Q) (5) 

The reduction of attributes is achieved by comparing equivalence relations 
generated by sets of attributes. Attributes are removed so that the reduced set 
provides the same quality of classification as the original. A reduct is defined 
as a subset R of the conditional attribute set C such that 7_r(U) = 7c (D). A 
given dataset may have many attribute reduct sets, so the set R of all reducts is 
defined as: 



R = {X:XCC, 1x (D)= 1c (D)} (6) 

The intersection of all the sets in R is called the core, the elements of which 
are those attributes that cannot be eliminated without introducing more contra- 
dictions to the dataset. In RSAR, a reduct with minimum cardinality is searched 
for; in other words an attempt is made to locate a single element of the minimal 
reduct set R min C R : 

Rmin = {X : X G R, vr e R, \X\ < |Y|} (7) 

A basic way of achieving this is to calculate the dependencies of all possible 
subsets of C. Any subset with 7 (D) = 1 is a reduct; the smallest subset with 
this property is a minimal reduct. However, for large datasets this method is 
impractical and an alternative strategy is required. 



1. i? t— {} 

2. do 

3. T <r- R 

4. Va : e (C - R) 

5- if Jru{x}(D) > 7t(D) 

6. T R U {x} 

7. R^T 

8. until 7 r (D) = 7 c(D) 

9. return R 



Fig. 1. The QuickReduct Algorithm 



The QuickReduct algorithm attempts to calculate a minimal reduct 
without exhaustively generating all possible subsets. It starts off with an empty 
set and adds in turn those attributes that result in the greatest increase in 
7p(Q), until this produces its maximum possible value for the dataset (usually 1). 
However, it has been proved that this method does not always generate a minimal 
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reduct, as ^fp(Q) is not a perfect heuristic. It does result in a close-to-minimal 
reduct, though, which is still useful in greatly reducing dataset dimensionality. 

An intuitive understanding of QuickReduct implies that, for a dimension- 
ality of n, n! evaluations of the dependency function may be performed for the 
worst-case dataset. From experimentation, the average complexity has been de- 
termined to be approximately O(n). 

2.2 Entropy-Based Reduction 

To support the comparative study of the performance of RSAR for use in 
bookmark classification, the Entropy-based Reduction (EBR) technique is sum- 
marised here. This approach is based on the entropy heuristic employed by ma- 
chine learning techniques such as ID 3 cm - a similar approach has been adopted 
in m where an entropy measure is used for ranking features. 

EBR is concerned with examining a dataset and determining those attributes 
that provide the most gain in information. The entropy of attribute A (which can 
take values ai...a m ) with respect to the conclusion C (values ci...c n ) is defined 



Using this function, the entropy of each conditional attribute appearing in 
a decision table can be calculated. The attribute with the lowest entropy is 
deemed to be the one that has the highest information gain, and so is the most 
useful determiner. By selecting only a certain number of attributes with the 
lowest entropies, a reductQ for the dataset can be constructed. Note that the 
determination of the number of attributes required to construct the reduct needs 
additional information other than given in the dataset. 

In this work, for comparison, such a number is decided on by the size of a 
reduct produced by the rough set-based approach (which is solely determined 
by the dataset itself). 

3 Bookmark Classification System Design 

The application of rough sets to the domain of text classification has been at- 
tempted previously with some success D2|> but has not yet been applied to book- 
mark classification. Bookmark databases are very information-poor, the useful 
information can only be found in the URL and title fields. Therefore, steps must 
be taken to ensure that all relevant information is used in the classification pro- 
cess, with any misleading or useless data removed. 

The sorting system developed here is modular in structure, allowing various 
sub-components to be replaced with alternative implementations if the need 
arises. The main modules are Keyword Acquisition , Dimensionality Reduction 
and Classification. 

1 The term ‘reduct’ is used loosely here. 



as: 



m 



n 




( 8 ) 
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To clarify the operation of the system, an example is included. The fol- 
lowing bookmark is one of many contained in a database under the category 
Programming / Java: 

<A HREF="http : / /java, sun . com/Performance/ "> 

Ways to Increase Java Performance</A> 



3.1 Keyword Acquisition 

In order to compare the similarity of bookmarks, a suitable representation must 
be chosen. Each bookmark is considered to be a vector where the ith element is 
the weight of term i according to some weighting method (a metric). The size of 
the vector is equal to the total number of keywords determined from the training 
documents. 

This module produces weight-term pairs given a dataset. Each encountered 
word in a URL or title field is assigned a weight according to the metric used. 
Several metrics were implemented for this purpose: 

— Boolean Existential Metric. All keywords that exist in the document are 
given a weight of 1 , those that are absent are assigned o m3. 

— Frequency Count Metric. The normalized frequency of the keywords in the 
document is used as the weight m- 

— TF-IDF. The Term Frequency-Inverse Document Frequency Metric m as- 
signs higher weights to those keywords that occur frequently in the cur- 
rent document but not in most others. It is calculated using the formula: 
w(t, i) = Fi(t) x logjj~ t where Fi(t) is the frequency of term t in document i , 
N is the number of documents in the collection, and N t is the total number 
of documents that contain t. 

For the example bookmark, the keywords {java, sun, com, performance} are 
obtained from the URL, and the keywords {ways, increase, java, performance} 
from the title field. Using the simple boolean existential metric, the vector el- 
ements relating to these keywords will each contain the value 1 , the remainder 
0 . 

The resulting sets of weight-term pairs, no matter which keyword acquisition 
metric is adopted, are large in size and need to be greatly reduced to be of any 
practical use for classification. Hence, the next step: Dimensionality Reduction. 



3.2 Dimensionality Reduction 

Given the weight-term sets, this module aims to significantly reduce their size 
whilst retaining their information content and preserving the semantics of those 
remaining keywords. As mentioned earlier, two approaches have been developed 
for this purpose, namely RSAR and EBR. Once a reduct has been calculated, 
the dataset can then be reduced by deleting those attributes that are absent 
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from the reduct. The reduced dataset is now in a form that can be used by the 
classification module. 

Returning to the example, it may be decided by this module that the term 
“com” provides little or no useful information. The column relating to this term 
is removed from the main dataset. This process is repeated for all keywords 
deemed to be information-poor. 



3.3 Classification 

This module attempts to classify a given bookmark or bookmarks using the re- 
duced keyword datasets obtained by the dimensionality reduction stage. Each 
bookmark has been transformed into a weight-term vector by the keyword ac- 
quisition process. For comparison purposes, three different inference techniques 
were implemented to perform classification: 

— Boolean Inexact Model US]- This uses Boolean matching and scoring tech- 
niques. If a term exists in a document and is also present in the corresponding 
rule, then the score for that rule is increased; the rule with the highest score 
classifies the document. 

— Vector Space Model. The vector space model H3 procedure can be divided 
in to three stages. The first stage is document indexing, where content bear- 
ing terms are extracted from the document text. The second stage is the 
weighting of the indexed terms to enhance retrieval of documents relevant 
to the user. The last stage ranks the document with respect to the query 
according to the similarity measure. The similarity measure used here is the 
cosine coefficient, which measures the angle between the rule vector and the 
query vector, and is defined as: 



Sim(X , Y ) 



|xnr| 

V\X\y/\Y\ 



(9) 



— Fuzzy Reasoner. This follows the usual approach for the construction of 
fuzzy rule-based systems m- Reasoning is carried out by the fuzzy classifier 
using the dataset generated previously. All precondition memberships are 
evaluated, and the necessary logical conjunctions integrated (using the con- 
ventional minimum operator in the present implementation of the system). 
The rule with the highest score classifies the document. 



4 Results 

A large set of bookmarks was used as the training dataset. This database was 
generated by collating various online bookmark lists into one uniform collection. 
Each bookmark is pre-classified into a relevant category (for example, “Sports” 
or “Computing/Java”). An additional testing dataset of “unseen” bookmarks 
was also compiled from online resources. 
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The experiments presented here attempt to test whether RSDR is a useful 
tool for reducing data whilst retaining the information content. Additionally, 
experiments are carried out that compare the performance of RSDR with that of 
using EBR. Random-reduct (RR) generation (i.e. generating reducts randomly) 
was also used to compare the results. This method deletes random attributes 
from the dataset, but is constrained to leave the same number of attributes 
present as the RSAR method. The results of these approaches can be found in 
table El 

The classification modules (vector space model (VSM), boolean inexact 
model (BIM) and the fuzzy reasoner (FR)) are combined in order to improve 
the accuracy of the system; each combination is investigated. 



Table 1. Comparison of Unreduced and RS-reduced classification accuracy 



Dataset 


Attributes 

(URL) 


Attributes 

(Title) 


Unreduced 

RS-reduced 


1397 

514 


1283 

424 



From table H it can be seen that using rough set theory, the amount of 
attributes was reduced to around 35%. For email classification, the average re- 
duction of attributes was 3.5 orders of magnitude. This demonstrates that there 
is much less redundancy in the original datasets for the bookmark domain, which 
is intuitive as there is much less information in a bookmark than a document. 



Table 2. Comparison of reduction strategies with unreduced dataset 



Dataset 


VSM + BIM 


VSM + FR 


FR + BIM 


Unreduced 


55.6% 


49.7% 


45.0% 


RS-reduced 


49.1% 


47.3% 


42.0% 


EBR-reduced 


50.9% 


52.7% 


43.2% 


RR-reduced 


37.3% 


34.9% 


26.3% 



A comparison of the performance of the dimensionality reduction techniques 
is presented in table 01 The table shows that the overall accuracy is poor (obvi- 
ously, the random reduction gives worst results). The main point to make here 
is that the ability of the system to classify new data depends entirely on the 
quality (and to a certain extent the quantity) of the training data. It cannot, 
in general, be expected that the RS-reduced or the EBR-reduced experiments 
should perform much better than the original unreduced dataset, which itself 
only allows a rather low classification rate. 
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In light of the fact that bookmarks contain very little useful information, the 
results are unsurprising and perhaps a little better than anticipated. As stated 
earlier, the goal is to investigate how useful rough set theory is in reducing the 
training dataset. For this, it is interesting to compare how well the rough set- 
reduced approach fares against the unreduced dataset. Consider the unreduced 
dataset results to be the optimum, the table can then be rewritten as: 



Table 3. Comparison of reduction strategies 



Dataset 


VSM + BIM 


VSM + FR 


FR + BIM 


RS-reduced 


88.3% 


95.2% 


93.3% 


EBR-reduced 


91.5% 


106% 


96.0% 


RR-reduced 


67.1% 


70.2% 


58.4% 



Viewed this way, it can be seen that EBR. has the best results for each 
classifier pair, and is in fact better than the unreduced dataset in one instance. 
This could be due to the fact that EBR selects those attributes that provide 
the largest gain in information. This process might ignore otherwise misleading 
attributes that the unreduced dataset contains. The RS-reduced dataset can be 
thought of as a smaller version of the original dataset, and so this will fall prey 
to the same mistakes. 

Importantly, the performance of the RS-reduced dataset is almost as good. 
Although a very small amount of important information may have been lost in 
the rough set reduction approach, this information loss is not significant enough 
to reduce classification accuracy significantly, while the reduction of dimension- 
ality is substantial. 

The success of EBR in generating useful reducts is a little surprising, due 
to its straightforward approach. As an alternative data reduction technique, it 
fares well against RSDR. However, with EBR a threshold needs to be specified 
beforehand. With no RSDR reducts to estimate this value, there is no method 
available for discovering the appropriate number of attributes that should ap- 
pear. Another drawback with EBR is that it cannot find more than one possible 
reduct, which is perfectly fine for applications such as this, but may not be for 
more theoretical investigations. 

5 Conclusion 

Results clearly show that rough set theory can be used to significantly reduce the 
dimensionality of the training dataset without much loss in information content. 
The measured drop in classification accuracy was between 0.6% and 4% for the 
training dataset, which is within acceptable bounds. 

The main limitation of this system is that it will only be as good as the 
training dataset itself. Ideally, a much larger database of bookmarks would have 
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been used, but this would have required far too much time. It is not known 
how long it would take the QuickReduct algorithm to find a reduct for such 
a large dataset as it takes many hours to find one for the existing training 
dataset. A related problem is how to effectively handle the dynamic aspect of 
bookmarking. Typically, a user’s collection changes gradually over time, so an 
interesting extension to this work would be to incorporate these types of changes 
into the the learning framework. 

It has already been mentioned that the QuickReduct algorithm is not al- 
ways guaranteed to find a minimal reduct. One potential solution to this problem 
is to include an N-lookahead step before choosing the next attribute. This and 
other approaches are being investigated, including the use of distinction tables 
to determine the choice of attribute. Work is also being carried out that focuses 
on improving the speed and efficiency of QuickReduct. A promising research 
area being investigated is that of fuzzifying reducts PI This could be achieved 
by fuzzifying the dependency degree (the 7 function), using fuzzy-rough sets. 
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Abstract. The pages and hyperlinks of the World Wide Web may be 
viewed as nodes and edges in a directed graph. In this paper, we propose 
a new definition of the distance between two pages, called average-clicks. 
It is based on the probability to click a link through random surfing. 
We compare the average-clicks measure to the classical measure of clicks 
between two pages, and show average-clicks fits better to the users’ in- 
tuitions of distance. 



1 Introduction 

The World Wide Web provides considerable auxiliary information on top of the 
text of the Web pages, such as its link structure. There has been a fair amount 
of recent activity on how to exploit the link structure of the Web. Kleinberg 
distinguished between two types of Web sites which pertain to a certain search 
topic: hubs and authorities. A good hub is a page that points to many good 
authorities and a good authority is a page that is pointed to by many good hubs 
|8' . The hub scores and authority scores are determined by an iterative procedure. 
The pages with the highest scores are returned as hubs and authorities for the 
search topic. 

The GooglcQ search engine uses the link structure for ranking Web pages, 
called PageRank ££] • A page has high rank if the sum of the ranks of its backlinks 
is high. And the rank of a page is divided among its forward links evenly to 
contribute to the ranks of the pages they point to. PageRank is a global ranking 
of all Web pages, regardless of their content, based solely on their location in 
the Web’s graph structure. 

Most of these works, which analyze the structure of the Web graph, assume 
the length of each link to be 1 (unit), and the clicks between two pages are 
counted to measure the distance. For example, 0 finds the bipartite core, which 
is a densely linked community consisting of a set of authorities and a set of hubs 
within 1 click, shows that two randomly chosen documents on the web are 

1 http://google.com 
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